Parquet
This documentation is for an out-of-date version of Apache Flink. We recommend you use the latest stable version.

Parquet formats #

Flink has extensive built-in support for Apache Parquet. This allows to easily read from Parquet files with Flink. Be sure to include the Flink Parquet dependency to the pom.xml of your project.

<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-parquet__2.11</artifactId>
  <version>1.13.3</version>
</dependency>

In order to read data from a Parquet file, you have to specify one of the implementation of ParquetInputFormat. There are several depending on your needs:

  • ParquetPojoInputFormat<E> to read POJOs from parquet files
  • ParquetRowInputFormat to read Flink Rows (column oriented records) from parquet files
  • ParquetMapInputFormat to read Map records (Map of nested Flink type objects) from parquet files
  • ParquetAvroInputFormat to read Avro Generic Records from parquet files

Example for ParquetRowInputFormat:

MessageType parquetSchema = // use parquet libs to provide the parquet schema file and parse it or extract it from the parquet files
ParquetRowInputFormat parquetInputFormat = new ParquetRowInputFormat(new Path(filePath),  parquetSchema);
// project only needed fields if suited to reduce the amount of data. Use: parquetSchema#selectFields(projectedFieldNames);
DataStream<Row> input = env.createInput(parquetInputFormat);

Example for ParquetAvroInputFormat:

MessageType parquetSchema = // use parquet libs to provide the parquet schema file and parse it or extract it from the parquet files
ParquetAvroInputFormat parquetInputFormat = new ParquetAvroInputFormat(new Path(filePath),  parquetSchema);
// project only needed fields if suited to reduce the amount of data. Use: parquetSchema#selectFields(projectedFieldNames);
DataStream<GenericRecord> input = env.createInput(parquetInputFormat);