This documentation is for an out-of-date version of Apache Flink. We recommend you use the latest stable version.

Parquet formats #

Flink has extensive built-in support for Apache Parquet. This allows to easily read from Parquet files with Flink. Be sure to include the Flink Parquet dependency to the pom.xml of your project.

<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-parquet__2.11</artifactId>
  <version>1.13.6</version>
</dependency>

In order to read data from a Parquet file, you have to specify one of the implementation of ParquetInputFormat. There are several depending on your needs:

ParquetPojoInputFormat<E> to read POJOs from parquet files
ParquetRowInputFormat to read Flink Rows (column oriented records) from parquet files
ParquetMapInputFormat to read Map records (Map of nested Flink type objects) from parquet files
ParquetAvroInputFormat to read Avro Generic Records from parquet files

Example for ParquetRowInputFormat:

MessageType parquetSchema = // use parquet libs to provide the parquet schema file and parse it or extract it from the parquet files
ParquetRowInputFormat parquetInputFormat = new ParquetRowInputFormat(new Path(filePath),  parquetSchema);
// project only needed fields if suited to reduce the amount of data. Use: parquetSchema#selectFields(projectedFieldNames);
DataStream<Row> input = env.createInput(parquetInputFormat);

Example for ParquetAvroInputFormat:

MessageType parquetSchema = // use parquet libs to provide the parquet schema file and parse it or extract it from the parquet files
ParquetAvroInputFormat parquetInputFormat = new ParquetAvroInputFormat(new Path(filePath),  parquetSchema);
// project only needed fields if suited to reduce the amount of data. Use: parquetSchema#selectFields(projectedFieldNames);
DataStream<GenericRecord> input = env.createInput(parquetInputFormat);