public class ParquetColumnarRowInputFormat<SplitT extends FileSourceSplit> extends ParquetVectorizedInputFormat<RowData,SplitT>
ParquetVectorizedInputFormat
to provide RowData
iterator. Using ColumnarRowData
to provide a row view of column batch.ParquetVectorizedInputFormat.ParquetReaderBatch<T>
BulkFormat.Reader<T>, BulkFormat.RecordIterator<T>
Constructor and Description |
---|
ParquetColumnarRowInputFormat(Configuration hadoopConfig,
RowType projectedType,
int batchSize,
boolean isUtcTimestamp,
boolean isCaseSensitive)
Constructor to create parquet format without extra fields.
|
ParquetColumnarRowInputFormat(Configuration hadoopConfig,
RowType projectedType,
RowType producedType,
ColumnBatchFactory<SplitT> batchFactory,
int batchSize,
boolean isUtcTimestamp,
boolean isCaseSensitive)
Constructor to create parquet format with extra fields created by
ColumnBatchFactory . |
Modifier and Type | Method and Description |
---|---|
static <SplitT extends FileSourceSplit> |
createPartitionedFormat(Configuration hadoopConfig,
RowType producedRowType,
List<String> partitionKeys,
PartitionFieldExtractor<SplitT> extractor,
int batchSize,
boolean isUtcTimestamp,
boolean isCaseSensitive)
Create a partitioned
ParquetColumnarRowInputFormat , the partition columns can be
generated by Path . |
protected ParquetVectorizedInputFormat.ParquetReaderBatch<RowData> |
createReaderBatch(WritableColumnVector[] writableVectors,
VectorizedColumnBatch columnarBatch,
Pool.Recycler<ParquetVectorizedInputFormat.ParquetReaderBatch<RowData>> recycler) |
TypeInformation<RowData> |
getProducedType()
Gets the type produced by this format.
|
protected int |
numBatchesToCirculate(Configuration config) |
createReader, isSplittable, restoreReader
public ParquetColumnarRowInputFormat(Configuration hadoopConfig, RowType projectedType, int batchSize, boolean isUtcTimestamp, boolean isCaseSensitive)
public ParquetColumnarRowInputFormat(Configuration hadoopConfig, RowType projectedType, RowType producedType, ColumnBatchFactory<SplitT> batchFactory, int batchSize, boolean isUtcTimestamp, boolean isCaseSensitive)
ColumnBatchFactory
.projectedType
- the projected row type for parquet format, excludes extra fields.producedType
- the produced row type for this input format, includes extra fields.batchFactory
- factory for creating column batch, can cram in extra fields.protected int numBatchesToCirculate(Configuration config)
numBatchesToCirculate
in class ParquetVectorizedInputFormat<RowData,SplitT extends FileSourceSplit>
protected ParquetVectorizedInputFormat.ParquetReaderBatch<RowData> createReaderBatch(WritableColumnVector[] writableVectors, VectorizedColumnBatch columnarBatch, Pool.Recycler<ParquetVectorizedInputFormat.ParquetReaderBatch<RowData>> recycler)
createReaderBatch
in class ParquetVectorizedInputFormat<RowData,SplitT extends FileSourceSplit>
writableVectors
- vectors to be writecolumnarBatch
- vectors to be readrecycler
- batch recyclerpublic TypeInformation<RowData> getProducedType()
BulkFormat
public static <SplitT extends FileSourceSplit> ParquetColumnarRowInputFormat<SplitT> createPartitionedFormat(Configuration hadoopConfig, RowType producedRowType, List<String> partitionKeys, PartitionFieldExtractor<SplitT> extractor, int batchSize, boolean isUtcTimestamp, boolean isCaseSensitive)
ParquetColumnarRowInputFormat
, the partition columns can be
generated by Path
.Copyright © 2014–2023 The Apache Software Foundation. All rights reserved.