T
- The type of records produced by this reader format.public abstract class AbstractOrcFileInputFormat<T,BatchT,SplitT extends FileSourceSplit> extends Object implements BulkFormat<T,SplitT>
FileSource
.
Implements the reader initialization, vectorized reading, and pooling of column vector objects.
Subclasses implement the conversion to the specific result record(s) that they return by
creating via extending AbstractOrcFileInputFormat.OrcReaderBatch
.
Modifier and Type | Class and Description |
---|---|
protected static class |
AbstractOrcFileInputFormat.OrcReaderBatch<T,BatchT>
The
OrcReaderBatch class holds the data structures containing the batch data (column
vectors, row arrays, ...) and performs the batch conversion from the ORC representation to
the result format. |
protected static class |
AbstractOrcFileInputFormat.OrcVectorizedReader<T,BatchT>
A vectorized ORC reader.
|
BulkFormat.Reader<T>, BulkFormat.RecordIterator<T>
Modifier and Type | Field and Description |
---|---|
protected int |
batchSize |
protected List<OrcFilters.Predicate> |
conjunctPredicates |
protected SerializableHadoopConfigWrapper |
hadoopConfigWrapper |
protected org.apache.orc.TypeDescription |
schema |
protected int[] |
selectedFields |
protected OrcShim<BatchT> |
shim |
Modifier | Constructor and Description |
---|---|
protected |
AbstractOrcFileInputFormat(OrcShim<BatchT> shim,
org.apache.hadoop.conf.Configuration hadoopConfig,
org.apache.orc.TypeDescription schema,
int[] selectedFields,
List<OrcFilters.Predicate> conjunctPredicates,
int batchSize) |
Modifier and Type | Method and Description |
---|---|
AbstractOrcFileInputFormat.OrcVectorizedReader<T,BatchT> |
createReader(Configuration config,
SplitT split)
Creates a new reader that reads from the
split's path starting
at the split's offset and reads length bytes after the offset. |
abstract AbstractOrcFileInputFormat.OrcReaderBatch<T,BatchT> |
createReaderBatch(SplitT split,
OrcVectorizedBatchWrapper<BatchT> orcBatch,
Pool.Recycler<AbstractOrcFileInputFormat.OrcReaderBatch<T,BatchT>> recycler,
int batchSize)
Creates the
AbstractOrcFileInputFormat.OrcReaderBatch structure, which is responsible for holding the data
structures that hold the batch data (column vectors, row arrays, ...) and the batch
conversion from the ORC representation to the result format. |
abstract TypeInformation<T> |
getProducedType()
Gets the type produced by this format.
|
boolean |
isSplittable()
Checks whether this format is splittable.
|
AbstractOrcFileInputFormat.OrcVectorizedReader<T,BatchT> |
restoreReader(Configuration config,
SplitT split)
Creates a new reader that reads from
split.path() starting at offset and
reads until length bytes after the offset. |
protected final SerializableHadoopConfigWrapper hadoopConfigWrapper
protected final org.apache.orc.TypeDescription schema
protected final int[] selectedFields
protected final List<OrcFilters.Predicate> conjunctPredicates
protected final int batchSize
protected AbstractOrcFileInputFormat(OrcShim<BatchT> shim, org.apache.hadoop.conf.Configuration hadoopConfig, org.apache.orc.TypeDescription schema, int[] selectedFields, List<OrcFilters.Predicate> conjunctPredicates, int batchSize)
shim
- the shim for various Orc dependent versions. If you use the latest version,
please use OrcShim.defaultShim()
directly.hadoopConfig
- the hadoop config for orc reader.schema
- the full schema of orc format.selectedFields
- the read selected field of orc format.conjunctPredicates
- the filter predicates that can be evaluated.batchSize
- the batch size of orc reader.public AbstractOrcFileInputFormat.OrcVectorizedReader<T,BatchT> createReader(Configuration config, SplitT split) throws IOException
BulkFormat
split's path
starting
at the split's offset
and reads length
bytes after the offset.createReader
in interface BulkFormat<T,SplitT extends FileSourceSplit>
IOException
public AbstractOrcFileInputFormat.OrcVectorizedReader<T,BatchT> restoreReader(Configuration config, SplitT split) throws IOException
BulkFormat
split.path()
starting at offset
and
reads until length
bytes after the offset. A number of recordsToSkip
records
should be read and discarded after the offset. This is typically part of restoring a reader
to a checkpointed position.restoreReader
in interface BulkFormat<T,SplitT extends FileSourceSplit>
IOException
public boolean isSplittable()
BulkFormat
See top-level JavaDocs
(section "Splitting") for details.
isSplittable
in interface BulkFormat<T,SplitT extends FileSourceSplit>
public abstract AbstractOrcFileInputFormat.OrcReaderBatch<T,BatchT> createReaderBatch(SplitT split, OrcVectorizedBatchWrapper<BatchT> orcBatch, Pool.Recycler<AbstractOrcFileInputFormat.OrcReaderBatch<T,BatchT>> recycler, int batchSize)
AbstractOrcFileInputFormat.OrcReaderBatch
structure, which is responsible for holding the data
structures that hold the batch data (column vectors, row arrays, ...) and the batch
conversion from the ORC representation to the result format.public abstract TypeInformation<T> getProducedType()
getProducedType
in interface ResultTypeQueryable<T>
getProducedType
in interface BulkFormat<T,SplitT extends FileSourceSplit>
Copyright © 2014–2024 The Apache Software Foundation. All rights reserved.