public class ParquetColumnarRowInputFormat<SplitT extends FileSourceSplit> extends ParquetVectorizedInputFormat<RowData,SplitT> implements FileBasedStatisticsReportableInputFormat
ParquetVectorizedInputFormat
to provide RowData
iterator. Using ColumnarRowData
to provide a row view of column batch.ParquetVectorizedInputFormat.ParquetReaderBatch<T>
BulkFormat.Reader<T>, BulkFormat.RecordIterator<T>
hadoopConfig, isUtcTimestamp
Constructor and Description |
---|
ParquetColumnarRowInputFormat(org.apache.hadoop.conf.Configuration hadoopConfig,
RowType projectedType,
TypeInformation<RowData> producedTypeInfo,
int batchSize,
boolean isUtcTimestamp,
boolean isCaseSensitive)
Constructor to create parquet format without extra fields.
|
Modifier and Type | Method and Description |
---|---|
static <SplitT extends FileSourceSplit> |
createPartitionedFormat(org.apache.hadoop.conf.Configuration hadoopConfig,
RowType producedRowType,
TypeInformation<RowData> producedTypeInfo,
List<String> partitionKeys,
PartitionFieldExtractor<SplitT> extractor,
int batchSize,
boolean isUtcTimestamp,
boolean isCaseSensitive)
Create a partitioned
ParquetColumnarRowInputFormat , the partition columns can be
generated by Path . |
protected ParquetVectorizedInputFormat.ParquetReaderBatch<RowData> |
createReaderBatch(WritableColumnVector[] writableVectors,
VectorizedColumnBatch columnarBatch,
Pool.Recycler<ParquetVectorizedInputFormat.ParquetReaderBatch<RowData>> recycler) |
TypeInformation<RowData> |
getProducedType()
Gets the type produced by this format.
|
protected int |
numBatchesToCirculate(Configuration config) |
TableStats |
reportStatistics(List<Path> files,
DataType producedDataType)
Returns the estimated statistics of this input format.
|
createReader, isSplittable, restoreReader
public ParquetColumnarRowInputFormat(org.apache.hadoop.conf.Configuration hadoopConfig, RowType projectedType, TypeInformation<RowData> producedTypeInfo, int batchSize, boolean isUtcTimestamp, boolean isCaseSensitive)
protected int numBatchesToCirculate(Configuration config)
numBatchesToCirculate
in class ParquetVectorizedInputFormat<RowData,SplitT extends FileSourceSplit>
protected ParquetVectorizedInputFormat.ParquetReaderBatch<RowData> createReaderBatch(WritableColumnVector[] writableVectors, VectorizedColumnBatch columnarBatch, Pool.Recycler<ParquetVectorizedInputFormat.ParquetReaderBatch<RowData>> recycler)
createReaderBatch
in class ParquetVectorizedInputFormat<RowData,SplitT extends FileSourceSplit>
writableVectors
- vectors to be writecolumnarBatch
- vectors to be readrecycler
- batch recyclerpublic TypeInformation<RowData> getProducedType()
BulkFormat
getProducedType
in interface ResultTypeQueryable<RowData>
getProducedType
in interface BulkFormat<RowData,SplitT extends FileSourceSplit>
public TableStats reportStatistics(List<Path> files, DataType producedDataType)
FileBasedStatisticsReportableInputFormat
reportStatistics
in interface FileBasedStatisticsReportableInputFormat
files
- The files to be estimated.producedDataType
- the final output type of the format.public static <SplitT extends FileSourceSplit> ParquetColumnarRowInputFormat<SplitT> createPartitionedFormat(org.apache.hadoop.conf.Configuration hadoopConfig, RowType producedRowType, TypeInformation<RowData> producedTypeInfo, List<String> partitionKeys, PartitionFieldExtractor<SplitT> extractor, int batchSize, boolean isUtcTimestamp, boolean isCaseSensitive)
ParquetColumnarRowInputFormat
, the partition columns can be
generated by Path
.Copyright © 2014–2024 The Apache Software Foundation. All rights reserved.