This documentation is for an unreleased version of Apache Flink. We recommend you use the latest stable version.
Parquet
Parquet Format #
Format: Serialization Schema Format: Deserialization Schema
The Apache Parquet format allows to read and write Parquet data.
Dependencies #
In order to use the Parquet format the following dependencies are required for both projects using a build automation tool (such as Maven or SBT) and SQL Client with SQL JAR bundles.
Maven dependency | SQL Client |
---|---|
|
Only available for stable releases. |
How to create a table with Parquet format #
Here is an example to create a table using Filesystem connector and Parquet format.
CREATE TABLE user_behavior (
user_id BIGINT,
item_id BIGINT,
category_id BIGINT,
behavior STRING,
ts TIMESTAMP(3),
dt STRING
) PARTITIONED BY (dt) WITH (
'connector' = 'filesystem',
'path' = '/tmp/user_behavior',
'format' = 'parquet'
)
Format Options #
Option | Required | Default | Type | Description |
---|---|---|---|---|
format |
required | (none) | String | Specify what format to use, here should be 'parquet'. |
parquet.utc-timezone |
optional | false | Boolean | Use UTC timezone or local timezone to the conversion between epoch time and LocalDateTime. Hive 0.x/1.x/2.x use local timezone. But Hive 3.x use UTC timezone. |
timestamp.time.unit |
optional | micros | String | Store parquet int64/LogicalTypes timestamps in this time unit, value is nanos/micros/millis. |
write.int64.timestamp |
optional | false | Boolean | Write parquet timestamp as int64/LogicalTypes instead of int96/OriginalTypes. Note: Timestamp will be time zone agnostic (NEVER converted to a different time zone). |
Parquet format also supports configuration from ParquetOutputFormat.
For example, you can configure parquet.compression=GZIP
to enable gzip compression.
Data Type Mapping #
Currently, Parquet format type mapping is compatible with Apache Hive, but by default not with Apache Spark:
- Timestamp: mapping timestamp type to int96 whatever the precision is.
- Spark compatibility requires int64 via config option
write.int64.timestamp
(see above). - Decimal: mapping decimal type to fixed length byte array according to the precision.
The following table lists the type mapping from Flink type to Parquet type.
Flink Data Type | Parquet type | Parquet logical type | Limitations |
---|---|---|---|
CHAR / VARCHAR / STRING | BINARY | UTF8 | |
BOOLEAN | BOOLEAN | ||
BINARY / VARBINARY | BINARY | ||
DECIMAL | FIXED_LEN_BYTE_ARRAY | DECIMAL | |
TINYINT | INT32 | INT_8 | |
SMALLINT | INT32 | INT_16 | |
INT | INT32 | ||
BIGINT | INT64 | ||
FLOAT | FLOAT | ||
DOUBLE | DOUBLE | ||
DATE | INT32 | DATE | |
TIME | INT32 | TIME_MILLIS | |
TIMESTAMP | INT96 (or INT64) | ||
ARRAY | - | LIST | |
MAP | - | MAP | [Parquet does not support nullable map keys](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps) |
MULTISET | - | MAP | [Parquet does not support nullable map keys](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps) |
ROW | - | STRUCT |