Package org.apache.flink.orc
Class AbstractOrcFileInputFormat<T,BatchT,SplitT extends FileSourceSplit>
- java.lang.Object
-
- org.apache.flink.orc.AbstractOrcFileInputFormat<T,BatchT,SplitT>
-
- Type Parameters:
T
- The type of records produced by this reader format.
- All Implemented Interfaces:
Serializable
,ResultTypeQueryable<T>
,BulkFormat<T,SplitT>
- Direct Known Subclasses:
OrcColumnarRowInputFormat
public abstract class AbstractOrcFileInputFormat<T,BatchT,SplitT extends FileSourceSplit> extends Object implements BulkFormat<T,SplitT>
The base for ORC readers for theFileSource
. Implements the reader initialization, vectorized reading, and pooling of column vector objects.Subclasses implement the conversion to the specific result record(s) that they return by creating via extending
AbstractOrcFileInputFormat.OrcReaderBatch
.- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description protected static class
AbstractOrcFileInputFormat.OrcReaderBatch<T,BatchT>
TheOrcReaderBatch
class holds the data structures containing the batch data (column vectors, row arrays, ...) and performs the batch conversion from the ORC representation to the result format.protected static class
AbstractOrcFileInputFormat.OrcVectorizedReader<T,BatchT>
A vectorized ORC reader.-
Nested classes/interfaces inherited from interface org.apache.flink.connector.file.src.reader.BulkFormat
BulkFormat.Reader<T>, BulkFormat.RecordIterator<T>
-
-
Field Summary
Fields Modifier and Type Field Description protected int
batchSize
protected List<OrcFilters.Predicate>
conjunctPredicates
protected SerializableHadoopConfigWrapper
hadoopConfigWrapper
protected org.apache.orc.TypeDescription
schema
protected int[]
selectedFields
protected OrcShim<BatchT>
shim
-
Constructor Summary
Constructors Modifier Constructor Description protected
AbstractOrcFileInputFormat(OrcShim<BatchT> shim, org.apache.hadoop.conf.Configuration hadoopConfig, org.apache.orc.TypeDescription schema, int[] selectedFields, List<OrcFilters.Predicate> conjunctPredicates, int batchSize)
-
Method Summary
All Methods Instance Methods Abstract Methods Concrete Methods Modifier and Type Method Description AbstractOrcFileInputFormat.OrcVectorizedReader<T,BatchT>
createReader(Configuration config, SplitT split)
Creates a new reader that reads from thesplit's path
starting at thesplit's offset
and readslength
bytes after the offset.abstract AbstractOrcFileInputFormat.OrcReaderBatch<T,BatchT>
createReaderBatch(SplitT split, OrcVectorizedBatchWrapper<BatchT> orcBatch, Pool.Recycler<AbstractOrcFileInputFormat.OrcReaderBatch<T,BatchT>> recycler, int batchSize)
Creates theAbstractOrcFileInputFormat.OrcReaderBatch
structure, which is responsible for holding the data structures that hold the batch data (column vectors, row arrays, ...) and the batch conversion from the ORC representation to the result format.abstract TypeInformation<T>
getProducedType()
Gets the type produced by this format.boolean
isSplittable()
Checks whether this format is splittable.AbstractOrcFileInputFormat.OrcVectorizedReader<T,BatchT>
restoreReader(Configuration config, SplitT split)
Creates a new reader that reads fromsplit.path()
starting atoffset
and reads untillength
bytes after the offset.
-
-
-
Field Detail
-
hadoopConfigWrapper
protected final SerializableHadoopConfigWrapper hadoopConfigWrapper
-
schema
protected final org.apache.orc.TypeDescription schema
-
selectedFields
protected final int[] selectedFields
-
conjunctPredicates
protected final List<OrcFilters.Predicate> conjunctPredicates
-
batchSize
protected final int batchSize
-
-
Constructor Detail
-
AbstractOrcFileInputFormat
protected AbstractOrcFileInputFormat(OrcShim<BatchT> shim, org.apache.hadoop.conf.Configuration hadoopConfig, org.apache.orc.TypeDescription schema, int[] selectedFields, List<OrcFilters.Predicate> conjunctPredicates, int batchSize)
- Parameters:
shim
- the shim for various Orc dependent versions. If you use the latest version, please useOrcShim.defaultShim()
directly.hadoopConfig
- the hadoop config for orc reader.schema
- the full schema of orc format.selectedFields
- the read selected field of orc format.conjunctPredicates
- the filter predicates that can be evaluated.batchSize
- the batch size of orc reader.
-
-
Method Detail
-
createReader
public AbstractOrcFileInputFormat.OrcVectorizedReader<T,BatchT> createReader(Configuration config, SplitT split) throws IOException
Description copied from interface:BulkFormat
Creates a new reader that reads from thesplit's path
starting at thesplit's offset
and readslength
bytes after the offset.- Specified by:
createReader
in interfaceBulkFormat<T,BatchT>
- Throws:
IOException
-
restoreReader
public AbstractOrcFileInputFormat.OrcVectorizedReader<T,BatchT> restoreReader(Configuration config, SplitT split) throws IOException
Description copied from interface:BulkFormat
Creates a new reader that reads fromsplit.path()
starting atoffset
and reads untillength
bytes after the offset. A number ofrecordsToSkip
records should be read and discarded after the offset. This is typically part of restoring a reader to a checkpointed position.- Specified by:
restoreReader
in interfaceBulkFormat<T,BatchT>
- Throws:
IOException
-
isSplittable
public boolean isSplittable()
Description copied from interface:BulkFormat
Checks whether this format is splittable. Splittable formats allow Flink to create multiple splits per file, so that Flink can read multiple regions of the file concurrently.See
top-level JavaDocs
(section "Splitting") for details.- Specified by:
isSplittable
in interfaceBulkFormat<T,BatchT>
-
createReaderBatch
public abstract AbstractOrcFileInputFormat.OrcReaderBatch<T,BatchT> createReaderBatch(SplitT split, OrcVectorizedBatchWrapper<BatchT> orcBatch, Pool.Recycler<AbstractOrcFileInputFormat.OrcReaderBatch<T,BatchT>> recycler, int batchSize)
Creates theAbstractOrcFileInputFormat.OrcReaderBatch
structure, which is responsible for holding the data structures that hold the batch data (column vectors, row arrays, ...) and the batch conversion from the ORC representation to the result format.
-
getProducedType
public abstract TypeInformation<T> getProducedType()
Gets the type produced by this format.- Specified by:
getProducedType
in interfaceBulkFormat<T,BatchT>
- Specified by:
getProducedType
in interfaceResultTypeQueryable<T>
- Returns:
- The data type produced by this function or input format.
-
-