This documentation is for an unreleased version of Apache Flink. We recommend you use the latest stable version.

Parquet Format #

Format: Serialization Schema Format: Deserialization Schema

The Apache Parquet format allows to read and write Parquet data.

Dependencies #

In order to use the Parquet format the following dependencies are required for both projects using a build automation tool (such as Maven or SBT) and SQL Client with SQL JAR bundles.

Maven dependency	SQL Client
`<dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-parquet</artifactId> <version>2.0-SNAPSHOT</version> </dependency>` Copied to clipboard!	Only available for stable releases.

How to create a table with Parquet format #

Here is an example to create a table using Filesystem connector and Parquet format.

CREATE TABLE user_behavior (
  user_id BIGINT,
  item_id BIGINT,
  category_id BIGINT,
  behavior STRING,
  ts TIMESTAMP(3),
  dt STRING
) PARTITIONED BY (dt) WITH (
 'connector' = 'filesystem',
 'path' = '/tmp/user_behavior',
 'format' = 'parquet'
)

Format Options #

Option	Required	Default	Type	Description
format	required	(none)	String	Specify what format to use, here should be 'parquet'.
parquet.utc-timezone	optional	false	Boolean	Use UTC timezone or local timezone to the conversion between epoch time and LocalDateTime. Hive 0.x/1.x/2.x use local timezone. But Hive 3.x use UTC timezone.
timestamp.time.unit	optional	micros	String	Store parquet int64/LogicalTypes timestamps in this time unit, value is nanos/micros/millis.
write.int64.timestamp	optional	false	Boolean	Write parquet timestamp as int64/LogicalTypes instead of int96/OriginalTypes. Note: Timestamp will be time zone agnostic (NEVER converted to a different time zone).

Parquet format also supports configuration from ParquetOutputFormat. For example, you can configure parquet.compression=GZIP to enable gzip compression.

Data Type Mapping #

Currently, Parquet format type mapping is compatible with Apache Hive, but by default not with Apache Spark:

Timestamp: mapping timestamp type to int96 whatever the precision is.
Spark compatibility requires int64 via config option write.int64.timestamp (see above).
Decimal: mapping decimal type to fixed length byte array according to the precision.

The following table lists the type mapping from Flink type to Parquet type.

Flink Data Type	Parquet type	Parquet logical type	Limitations
CHAR / VARCHAR / STRING	BINARY	UTF8
BOOLEAN	BOOLEAN
BINARY / VARBINARY	BINARY
DECIMAL	FIXED_LEN_BYTE_ARRAY	DECIMAL
TINYINT	INT32	INT_8
SMALLINT	INT32	INT_16
INT	INT32
BIGINT	INT64
FLOAT	FLOAT
DOUBLE	DOUBLE
DATE	INT32	DATE
TIME	INT32	TIME_MILLIS
TIMESTAMP	INT96 (or INT64)
ARRAY	-	LIST
MAP	-	MAP	[Parquet does not support nullable map keys](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps)
MULTISET	-	MAP	[Parquet does not support nullable map keys](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps)
ROW	-	STRUCT

Parquet Format #

Dependencies #

How to create a table with Parquet format #

Format Options #

format

parquet.utc-timezone

timestamp.time.unit

write.int64.timestamp

Data Type Mapping #