E
- The type of record to read.public abstract class ParquetInputFormat<E> extends FileInputFormat<E> implements CheckpointableInputFormat<FileInputSplit,Tuple2<Long,Long>>
convert(Row)
method need to be implemented.
Using ParquetRecordReader
to read files instead of FSDataInputStream
, we override open(FileInputSplit)
and close()
to change the behaviors.
FileInputFormat.FileBaseStatistics, FileInputFormat.InputSplitOpenThread
Modifier and Type | Field and Description |
---|---|
static String |
PARQUET_SKIP_CORRUPTED_RECORD
The config parameter which defines whether to skip corrupted record.
|
static String |
PARQUET_SKIP_WRONG_SCHEMA_SPLITS
The config parameter which defines whether to skip file split with wrong schema.
|
currentSplit, ENUMERATE_NESTED_FILES_FLAG, enumerateNestedFiles, filePath, INFLATER_INPUT_STREAM_FACTORIES, minSplitSize, numSplits, openTimeout, READ_WHOLE_SPLIT_FLAG, splitLength, splitStart, stream, unsplittable
Modifier | Constructor and Description |
---|---|
protected |
ParquetInputFormat(Path path,
org.apache.parquet.schema.MessageType messageType)
Read parquet files with given parquet file schema.
|
Modifier and Type | Method and Description |
---|---|
void |
close()
Closes the file input stream of the input format.
|
void |
configure(Configuration parameters)
Configures the file input format by reading the file path from the configuration.
|
protected abstract E |
convert(Row row)
This ParquetInputFormat read parquet record as Row by default.
|
Tuple2<Long,Long> |
getCurrentState()
Returns the split currently being read, along with its current state.
|
protected String[] |
getFieldNames()
Get field names of read result.
|
protected TypeInformation[] |
getFieldTypes()
Get field types of read result.
|
protected org.apache.parquet.filter2.predicate.FilterPredicate |
getPredicate() |
E |
nextRecord(E e)
Reads the next record from the input.
|
void |
open(FileInputSplit split)
Opens an input stream to the file defined in the input format.
|
boolean |
reachedEnd()
Method used to check if the end of the input is reached.
|
void |
reopen(FileInputSplit split,
Tuple2<Long,Long> state)
Restores the state of a parallel instance reading from an
InputFormat . |
void |
selectFields(String[] fieldNames)
Configures the fields to be read and returned by the ParquetInputFormat.
|
void |
setFilterPredicate(org.apache.parquet.filter2.predicate.FilterPredicate filterPredicate) |
acceptFile, createInputSplits, decorateInputStream, extractFileExtension, getFilePath, getFilePaths, getFileStats, getFileStats, getInflaterInputStreamFactory, getInputSplitAssigner, getMinSplitSize, getNestedFileEnumeration, getNumSplits, getOpenTimeout, getSplitLength, getSplitStart, getStatistics, registerInflaterInputStreamFactory, setFilePath, setFilePath, setFilePaths, setFilePaths, setFilesFilter, setMinSplitSize, setNestedFileEnumeration, setNumSplits, setOpenTimeout, supportsMultiPaths, testForUnsplittable, toString
closeInputFormat, getRuntimeContext, openInputFormat, setRuntimeContext
public static final String PARQUET_SKIP_WRONG_SCHEMA_SPLITS
public static final String PARQUET_SKIP_CORRUPTED_RECORD
protected ParquetInputFormat(Path path, org.apache.parquet.schema.MessageType messageType)
path
- The path of the file to read.messageType
- schema of parquet filepublic void configure(Configuration parameters)
FileInputFormat
configure
in interface InputFormat<E,FileInputSplit>
configure
in class FileInputFormat<E>
parameters
- The configuration with all parameters (note: not the Flink config but the
TaskConfig).InputFormat.configure(org.apache.flink.configuration.Configuration)
public void selectFields(String[] fieldNames)
fieldNames
- Names of all selected fields.public void setFilterPredicate(org.apache.parquet.filter2.predicate.FilterPredicate filterPredicate)
public Tuple2<Long,Long> getCurrentState()
CheckpointableInputFormat
getCurrentState
in interface CheckpointableInputFormat<FileInputSplit,Tuple2<Long,Long>>
public void open(FileInputSplit split) throws IOException
FileInputFormat
The stream is actually opened in an asynchronous thread to make sure any interruptions to the thread working on the input format do not reach the file system.
open
in interface InputFormat<E,FileInputSplit>
open
in class FileInputFormat<E>
split
- The split to be opened.IOException
- Thrown, if the spit could not be opened due to an I/O problem.public void reopen(FileInputSplit split, Tuple2<Long,Long> state) throws IOException
CheckpointableInputFormat
InputFormat
. This is
necessary when recovering from a task failure. When this method is called, the input format
it guaranteed to be configured.
NOTE: The caller has to make sure that the provided split is the one to whom the state belongs.
reopen
in interface CheckpointableInputFormat<FileInputSplit,Tuple2<Long,Long>>
split
- The split to be opened.state
- The state from which to start from. This can contain the offset, but also other
data, depending on the input format.IOException
protected String[] getFieldNames()
protected TypeInformation[] getFieldTypes()
@VisibleForTesting protected org.apache.parquet.filter2.predicate.FilterPredicate getPredicate()
public void close() throws IOException
FileInputFormat
close
in interface InputFormat<E,FileInputSplit>
close
in class FileInputFormat<E>
IOException
- Thrown, if the input could not be closed properly.public boolean reachedEnd() throws IOException
InputFormat
When this method is called, the input format it guaranteed to be opened.
reachedEnd
in interface InputFormat<E,FileInputSplit>
IOException
- Thrown, if an I/O error occurred.public E nextRecord(E e) throws IOException
InputFormat
When this method is called, the input format it guaranteed to be opened.
nextRecord
in interface InputFormat<E,FileInputSplit>
e
- Object that may be reused.IOException
- Thrown, if an I/O error occurred.Copyright © 2014–2021 The Apache Software Foundation. All rights reserved.