Parquet

Parquet Format #

Format: Serialization Schema Format: Deserialization Schema

The Apache Parquet format allows to read and write Parquet data.

Dependencies #

In order to use the Parquet format the following dependencies are required for both projects using a build automation tool (such as Maven or SBT) and SQL Client with SQL JAR bundles.

Maven dependency SQL Client
Download

How to create a table with Parquet format #

Here is an example to create a table using Filesystem connector and Parquet format.

CREATE TABLE user_behavior (
  user_id BIGINT,
  item_id BIGINT,
  category_id BIGINT,
  behavior STRING,
  ts TIMESTAMP(3),
  dt STRING
) PARTITIONED BY (dt) WITH (
 'connector' = 'filesystem',
 'path' = '/tmp/user_behavior',
 'format' = 'parquet'
)

Format Options #

Option Required Default Type Description
format
required (none) String Specify what format to use, here should be 'parquet'.
parquet.utc-timezone
optional false Boolean Use UTC timezone or local timezone to the conversion between epoch time and LocalDateTime. Hive 0.x/1.x/2.x use local timezone. But Hive 3.x use UTC timezone.
timestamp.time.unit
optional micros String Store parquet int64/LogicalTypes timestamps in this time unit, value is nanos/micros/millis.
write.int64.timestamp
optional false Boolean Write parquet timestamp as int64/LogicalTypes instead of int96/OriginalTypes. Note: Timestamp will be time zone agnostic (NEVER converted to a different time zone).

Parquet format also supports configuration from ParquetOutputFormat. For example, you can configure parquet.compression=GZIP to enable gzip compression.

Data Type Mapping #

Currently, Parquet format type mapping is compatible with Apache Hive, but by default not with Apache Spark:

  • Timestamp: mapping timestamp type to int96 whatever the precision is.
  • Spark compatibility requires int64 via config option write.int64.timestamp (see above).
  • Decimal: mapping decimal type to fixed length byte array according to the precision.

The following table lists the type mapping from Flink type to Parquet type.

Flink Data Type Parquet type Parquet logical type
CHAR / VARCHAR / STRING BINARY UTF8
BOOLEAN BOOLEAN
BINARY / VARBINARY BINARY
DECIMAL FIXED_LEN_BYTE_ARRAY DECIMAL
TINYINT INT32 INT_8
SMALLINT INT32 INT_16
INT INT32
BIGINT INT64
FLOAT FLOAT
DOUBLE DOUBLE
DATE INT32 DATE
TIME INT32 TIME_MILLIS
TIMESTAMP INT96 (or INT64)
ARRAY - LIST
MAP - MAP
ROW - STRUCT