@PublicEvolving public interface FileBasedStatisticsReportableInputFormat
This interface is used by file-based connectors which should also implement SupportsStatisticReport
. Since file have different formats, and each format has a different way
of storing and obtaining statistics information. For example: for Parquet and Orc, they both
store the metadata information in the file footer, which including row count, max/min, null
count, etc. While, for csv, there is no other metadata information excluding file size, one
approach to estimate row count is: the entire file size divided by the average length of the
sampled rows.
Note: This method is called at plan optimization phase, the implementation of this interface should be as light as possible, but more complete information.
Modifier and Type | Method and Description |
---|---|
TableStats |
reportStatistics(List<Path> files,
DataType producedDataType)
Returns the estimated statistics of this input format.
|
TableStats reportStatistics(List<Path> files, DataType producedDataType)
files
- The files to be estimated.producedDataType
- the final output type of the format.Copyright © 2014–2024 The Apache Software Foundation. All rights reserved.