This documentation is for an out-of-date version of Apache Flink. We recommend you use the latest stable version.
Parquet
Parquet formats #
Flink has extensive built-in support for Apache Parquet. This allows to easily read from Parquet files with Flink. Be sure to include the Flink Parquet dependency to the pom.xml of your project.
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-parquet__2.11</artifactId>
<version>1.13.6</version>
</dependency>
In order to read data from a Parquet file, you have to specify one of the implementation of ParquetInputFormat
. There are several depending on your needs:
ParquetPojoInputFormat<E>
to read POJOs from parquet filesParquetRowInputFormat
to read FlinkRows
(column oriented records) from parquet filesParquetMapInputFormat
to read Map records (Map of nested Flink type objects) from parquet filesParquetAvroInputFormat
to read Avro Generic Records from parquet files
Example for ParquetRowInputFormat:
MessageType parquetSchema = // use parquet libs to provide the parquet schema file and parse it or extract it from the parquet files
ParquetRowInputFormat parquetInputFormat = new ParquetRowInputFormat(new Path(filePath), parquetSchema);
// project only needed fields if suited to reduce the amount of data. Use: parquetSchema#selectFields(projectedFieldNames);
DataStream<Row> input = env.createInput(parquetInputFormat);
Example for ParquetAvroInputFormat:
MessageType parquetSchema = // use parquet libs to provide the parquet schema file and parse it or extract it from the parquet files
ParquetAvroInputFormat parquetInputFormat = new ParquetAvroInputFormat(new Path(filePath), parquetSchema);
// project only needed fields if suited to reduce the amount of data. Use: parquetSchema#selectFields(projectedFieldNames);
DataStream<GenericRecord> input = env.createInput(parquetInputFormat);