ParquetInputFormat (flink 1.11-SNAPSHOT API)

java.lang.Object
- org.apache.flink.api.common.io.RichInputFormat<OT,FileInputSplit>
- - org.apache.flink.api.common.io.FileInputFormat<E>
  - - org.apache.flink.formats.parquet.ParquetInputFormat<E>

Type Parameters:

E - The type of record to read.

All Implemented Interfaces:

Serializable, CheckpointableInputFormat<FileInputSplit,Tuple2<Long,Long>>, InputFormat<E,FileInputSplit>, InputSplitSource<FileInputSplit>

Direct Known Subclasses:

ParquetMapInputFormat, ParquetPojoInputFormat, ParquetRowInputFormat
```
public abstract class ParquetInputFormat<E>
extends FileInputFormat<E>
implements CheckpointableInputFormat<FileInputSplit,Tuple2<Long,Long>>
```
The base InputFormat class to read from Parquet files. For specific return types the convert(Row) method need to be implemented.
Using ParquetRecordReader to read files instead of FSDataInputStream, we override open(FileInputSplit) and close() to change the behaviors.

See Also:

Serialized Form

Nested Class Summary
- Nested classes/interfaces inherited from class org.apache.flink.api.common.io.FileInputFormat
  FileInputFormat.FileBaseStatistics, FileInputFormat.InputSplitOpenThread

Field Summary

Fields
Modifier and Type	Field and Description
`static String`	`PARQUET_SKIP_CORRUPTED_RECORD` The config parameter which defines whether to skip corrupted record.
`static String`	`PARQUET_SKIP_WRONG_SCHEMA_SPLITS` The config parameter which defines whether to skip file split with wrong schema.

Fields inherited from class org.apache.flink.api.common.io.FileInputFormat
currentSplit, ENUMERATE_NESTED_FILES_FLAG, enumerateNestedFiles, filePath, INFLATER_INPUT_STREAM_FACTORIES, minSplitSize, numSplits, openTimeout, READ_WHOLE_SPLIT_FLAG, splitLength, splitStart, stream, unsplittable

Constructor Summary

Constructors
Modifier	Constructor and Description
`protected`	`ParquetInputFormat(Path path, org.apache.parquet.schema.MessageType messageType)` Read parquet files with given parquet file schema.

Method Summary

All Methods Instance Methods Abstract Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`close()` Closes the file input stream of the input format.
`void`	`configure(Configuration parameters)` Configures the file input format by reading the file path from the configuration.
`protected abstract E`	`convert(Row row)` This ParquetInputFormat read parquet record as Row by default.
`Tuple2<Long,Long>`	`getCurrentState()` Returns the split currently being read, along with its current state.
`protected String[]`	`getFieldNames()` Get field names of read result.
`protected TypeInformation[]`	`getFieldTypes()` Get field types of read result.
`protected org.apache.parquet.filter2.predicate.FilterPredicate`	`getPredicate()`
`E`	`nextRecord(E e)` Reads the next record from the input.
`void`	`open(FileInputSplit split)` Opens an input stream to the file defined in the input format.
`boolean`	`reachedEnd()` Method used to check if the end of the input is reached.
`void`	`reopen(FileInputSplit split, Tuple2<Long,Long> state)` Restores the state of a parallel instance reading from an `InputFormat`.
`void`	`selectFields(String[] fieldNames)` Configures the fields to be read and returned by the ParquetInputFormat.
`void`	`setFilterPredicate(org.apache.parquet.filter2.predicate.FilterPredicate filterPredicate)`

Methods inherited from class org.apache.flink.api.common.io.RichInputFormat
closeInputFormat, getRuntimeContext, openInputFormat, setRuntimeContext

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

- Field Detail
  - PARQUET_SKIP_WRONG_SCHEMA_SPLITS
```
public static final String PARQUET_SKIP_WRONG_SCHEMA_SPLITS
```
    The config parameter which defines whether to skip file split with wrong schema.
    
    See Also:
    
    Constant Field Values
  - PARQUET_SKIP_CORRUPTED_RECORD
```
public static final String PARQUET_SKIP_CORRUPTED_RECORD
```
    The config parameter which defines whether to skip corrupted record.
    
    See Also:
    
    Constant Field Values
- Constructor Detail
  - ParquetInputFormat
```
protected ParquetInputFormat(Path path,
                             org.apache.parquet.schema.MessageType messageType)
```
    Read parquet files with given parquet file schema.
    
    Parameters:
    
    path - The path of the file to read.
    
    messageType - schema of parquet file
- Method Detail
  - configure
```
public void configure(Configuration parameters)
```
    Description copied from class: FileInputFormat
    
    Configures the file input format by reading the file path from the configuration.
    
    Specified by:
    
    configure in interface InputFormat<E,FileInputSplit>
    
    Overrides:
    
    configure in class FileInputFormat<E>
    
    Parameters:
    
    parameters - The configuration with all parameters (note: not the Flink config but the TaskConfig).
    
    See Also:
    
    InputFormat.configure(org.apache.flink.configuration.Configuration)
  - selectFields
```
public void selectFields(String[] fieldNames)
```
    Configures the fields to be read and returned by the ParquetInputFormat. Selected fields must be present in the configured schema.
    
    Parameters:
    
    fieldNames - Names of all selected fields.
  - setFilterPredicate
```
public void setFilterPredicate(org.apache.parquet.filter2.predicate.FilterPredicate filterPredicate)
```
  - getCurrentState
```
public Tuple2<Long,Long> getCurrentState()
```
    Description copied from interface: CheckpointableInputFormat
    
    Returns the split currently being read, along with its current state. This will be used to restore the state of the reading channel when recovering from a task failure. In the case of a simple text file, the state can correspond to the last read offset in the split.
    
    Specified by:
    
    getCurrentState in interface CheckpointableInputFormat<FileInputSplit,Tuple2<Long,Long>>
    
    Returns:
    
    The state of the channel.
  - open
```
public void open(FileInputSplit split)
          throws IOException
```
    Description copied from class: FileInputFormat
    
    Opens an input stream to the file defined in the input format. The stream is positioned at the beginning of the given split.
    The stream is actually opened in an asynchronous thread to make sure any interruptions to the thread working on the input format do not reach the file system.
    
    Specified by:
    
    open in interface InputFormat<E,FileInputSplit>
    
    Overrides:
    
    open in class FileInputFormat<E>
    
    Parameters:
    
    split - The split to be opened.
    
    Throws:
    
    IOException - Thrown, if the spit could not be opened due to an I/O problem.
  - reopen
```
public void reopen(FileInputSplit split,
                   Tuple2<Long,Long> state)
            throws IOException
```
    Description copied from interface: CheckpointableInputFormat
    
    Restores the state of a parallel instance reading from an InputFormat. This is necessary when recovering from a task failure. When this method is called, the input format it guaranteed to be configured.
    NOTE: The caller has to make sure that the provided split is the one to whom the state belongs.
    
    Specified by:
    
    reopen in interface CheckpointableInputFormat<FileInputSplit,Tuple2<Long,Long>>
    
    Parameters:
    
    split - The split to be opened.
    
    state - The state from which to start from. This can contain the offset, but also other data, depending on the input format.
    
    Throws:
    
    IOException
  - getFieldNames
```
protected String[] getFieldNames()
```
    Get field names of read result.
    
    Returns:
    
    field names array
  - getFieldTypes
```
protected TypeInformation[] getFieldTypes()
```
    Get field types of read result.
    
    Returns:
    
    field types array
  - getPredicate
```
@VisibleForTesting
protected org.apache.parquet.filter2.predicate.FilterPredicate getPredicate()
```
  - close
```
public void close()
           throws IOException
```
    Description copied from class: FileInputFormat
    
    Closes the file input stream of the input format.
    
    Specified by:
    
    close in interface InputFormat<E,FileInputSplit>
    
    Overrides:
    
    close in class FileInputFormat<E>
    
    Throws:
    
    IOException - Thrown, if the input could not be closed properly.
  - reachedEnd
```
public boolean reachedEnd()
                   throws IOException
```
    Description copied from interface: InputFormat
    
    Method used to check if the end of the input is reached.
    When this method is called, the input format it guaranteed to be opened.
    
    Specified by:
    
    reachedEnd in interface InputFormat<E,FileInputSplit>
    
    Returns:
    
    True if the end is reached, otherwise false.
    
    Throws:
    
    IOException - Thrown, if an I/O error occurred.
  - nextRecord
```
public E nextRecord(E e)
             throws IOException
```
    Description copied from interface: InputFormat
    
    Reads the next record from the input.
    When this method is called, the input format it guaranteed to be opened.
    
    Specified by:
    
    nextRecord in interface InputFormat<E,FileInputSplit>
    
    Parameters:
    
    e - Object that may be reused.
    
    Returns:
    
    Read record.
    
    Throws:
    
    IOException - Thrown, if an I/O error occurred.
  - convert
```
protected abstract E convert(Row row)
```
    This ParquetInputFormat read parquet record as Row by default. Sub classes of it can extend this method to further convert row to other types, such as POJO, Map or Tuple.
    
    Parameters:
    
    row - row read from parquet file
    
    Returns:
    
    E target result type

Back to Flink Website

Class ParquetInputFormat<E>

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.flink.api.common.io.FileInputFormat

Field Summary

Fields inherited from class org.apache.flink.api.common.io.FileInputFormat

Constructor Summary

Method Summary

Methods inherited from class org.apache.flink.api.common.io.FileInputFormat

Methods inherited from class org.apache.flink.api.common.io.RichInputFormat

Methods inherited from class java.lang.Object

Field Detail

PARQUET_SKIP_WRONG_SCHEMA_SPLITS

PARQUET_SKIP_CORRUPTED_RECORD

Constructor Detail

ParquetInputFormat

Method Detail

configure

selectFields

setFilterPredicate

getCurrentState

open

reopen

getFieldNames

getFieldTypes

getPredicate

close

reachedEnd

nextRecord

convert