public class OrcColumnarRowInputFormat<BatchT,SplitT extends FileSourceSplit> extends AbstractOrcFileInputFormat<RowData,BatchT,SplitT> implements FileBasedStatisticsReportableInputFormat
ColumnarRowData
records.
This class can add extra fields through ColumnBatchFactory
, for example, add partition
fields, which can be extracted from path. Therefore, the getProducedType()
may be
different and types of extra fields need to be added.
AbstractOrcFileInputFormat.OrcReaderBatch<T,BatchT>, AbstractOrcFileInputFormat.OrcVectorizedReader<T,BatchT>
BulkFormat.Reader<T>, BulkFormat.RecordIterator<T>
batchSize, conjunctPredicates, hadoopConfigWrapper, schema, selectedFields, shim
Constructor and Description |
---|
OrcColumnarRowInputFormat(OrcShim<BatchT> shim,
org.apache.hadoop.conf.Configuration hadoopConfig,
org.apache.orc.TypeDescription schema,
int[] selectedFields,
List<OrcFilters.Predicate> conjunctPredicates,
int batchSize,
ColumnBatchFactory<BatchT,SplitT> batchFactory,
TypeInformation<RowData> producedTypeInfo) |
Modifier and Type | Method and Description |
---|---|
static <SplitT extends FileSourceSplit> |
createPartitionedFormat(OrcShim<org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch> shim,
org.apache.hadoop.conf.Configuration hadoopConfig,
RowType tableType,
List<String> partitionKeys,
PartitionFieldExtractor<SplitT> extractor,
int[] selectedFields,
List<OrcFilters.Predicate> conjunctPredicates,
int batchSize,
Function<RowType,TypeInformation<RowData>> rowTypeInfoFactory)
Create a partitioned
OrcColumnarRowInputFormat , the partition columns can be
generated by split. |
AbstractOrcFileInputFormat.OrcReaderBatch<RowData,BatchT> |
createReaderBatch(SplitT split,
OrcVectorizedBatchWrapper<BatchT> orcBatch,
Pool.Recycler<AbstractOrcFileInputFormat.OrcReaderBatch<RowData,BatchT>> recycler,
int batchSize)
Creates the
AbstractOrcFileInputFormat.OrcReaderBatch structure, which is responsible for holding the data
structures that hold the batch data (column vectors, row arrays, ...) and the batch
conversion from the ORC representation to the result format. |
TypeInformation<RowData> |
getProducedType()
Gets the type produced by this format.
|
TableStats |
reportStatistics(List<Path> files,
DataType producedDataType)
Returns the estimated statistics of this input format.
|
createReader, isSplittable, restoreReader
public OrcColumnarRowInputFormat(OrcShim<BatchT> shim, org.apache.hadoop.conf.Configuration hadoopConfig, org.apache.orc.TypeDescription schema, int[] selectedFields, List<OrcFilters.Predicate> conjunctPredicates, int batchSize, ColumnBatchFactory<BatchT,SplitT> batchFactory, TypeInformation<RowData> producedTypeInfo)
public AbstractOrcFileInputFormat.OrcReaderBatch<RowData,BatchT> createReaderBatch(SplitT split, OrcVectorizedBatchWrapper<BatchT> orcBatch, Pool.Recycler<AbstractOrcFileInputFormat.OrcReaderBatch<RowData,BatchT>> recycler, int batchSize)
AbstractOrcFileInputFormat
AbstractOrcFileInputFormat.OrcReaderBatch
structure, which is responsible for holding the data
structures that hold the batch data (column vectors, row arrays, ...) and the batch
conversion from the ORC representation to the result format.createReaderBatch
in class AbstractOrcFileInputFormat<RowData,BatchT,SplitT extends FileSourceSplit>
public TypeInformation<RowData> getProducedType()
AbstractOrcFileInputFormat
getProducedType
in interface ResultTypeQueryable<RowData>
getProducedType
in interface BulkFormat<RowData,SplitT extends FileSourceSplit>
getProducedType
in class AbstractOrcFileInputFormat<RowData,BatchT,SplitT extends FileSourceSplit>
public TableStats reportStatistics(List<Path> files, DataType producedDataType)
FileBasedStatisticsReportableInputFormat
reportStatistics
in interface FileBasedStatisticsReportableInputFormat
files
- The files to be estimated.producedDataType
- the final output type of the format.public static <SplitT extends FileSourceSplit> OrcColumnarRowInputFormat<org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch,SplitT> createPartitionedFormat(OrcShim<org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatch> shim, org.apache.hadoop.conf.Configuration hadoopConfig, RowType tableType, List<String> partitionKeys, PartitionFieldExtractor<SplitT> extractor, int[] selectedFields, List<OrcFilters.Predicate> conjunctPredicates, int batchSize, Function<RowType,TypeInformation<RowData>> rowTypeInfoFactory)
OrcColumnarRowInputFormat
, the partition columns can be
generated by split.Copyright © 2014–2024 The Apache Software Foundation. All rights reserved.