Parquet
This documentation is for an out-of-date version of Apache Flink. We recommend you use the latest stable version.

Parquet Format #

Format: Serialization Schema Format: Deserialization Schema

The Apache Parquet format allows to read and write Parquet data.

Dependencies #

In order to use the Parquet format the following dependencies are required for both projects using a build automation tool (such as Maven or SBT) and SQL Client with SQL JAR bundles.

Maven dependency SQL Client
Download

How to create a table with Parquet format #

Here is an example to create a table using Filesystem connector and Parquet format.

CREATE TABLE user_behavior (
  user_id BIGINT,
  item_id BIGINT,
  category_id BIGINT,
  behavior STRING,
  ts TIMESTAMP(3),
  dt STRING
) PARTITIONED BY (dt) WITH (
 'connector' = 'filesystem',
 'path' = '/tmp/user_behavior',
 'format' = 'parquet'
)

Format Options #

Option Required Default Type Description
format
required (none) String Specify what format to use, here should be 'parquet'.
parquet.utc-timezone
optional false Boolean Use UTC timezone or local timezone to the conversion between epoch time and LocalDateTime. Hive 0.x/1.x/2.x use local timezone. But Hive 3.x use UTC timezone.

Parquet format also supports configuration from ParquetOutputFormat. For example, you can configure parquet.compression=GZIP to enable gzip compression.

Data Type Mapping #

Currently, Parquet format type mapping is compatible with Apache Hive, but different with Apache Spark:

  • Timestamp: mapping timestamp type to int96 whatever the precision is.
  • Decimal: mapping decimal type to fixed length byte array according to the precision.

The following table lists the type mapping from Flink type to Parquet type.

Flink Data Type Parquet type Parquet logical type
CHAR / VARCHAR / STRING BINARY UTF8
BOOLEAN BOOLEAN
BINARY / VARBINARY BINARY
DECIMAL FIXED_LEN_BYTE_ARRAY DECIMAL
TINYINT INT32 INT_8
SMALLINT INT32 INT_16
INT INT32
BIGINT INT64
FLOAT FLOAT
DOUBLE DOUBLE
DATE INT32 DATE
TIME INT32 TIME_MILLIS
TIMESTAMP INT96
ARRAY LIST
MAP MAP
ROW STRUCT
Composite data type: Array, Map and Row are currently only supported when writing, not reading.