Format: Serialization Schema Format: Deserialization Schema
The Apache Parquet format allows to read and write Parquet data.
In order to setup the Parquet format, the following table provides dependency information for both projects using a build automation tool (such as Maven or SBT) and SQL Client with SQL JAR bundles.
Maven dependency | SQL Client JAR |
---|---|
flink-parquet_2.11 | Download |
Here is an example to create a table using Filesystem connector and Parquet format.
Option | Required | Default | Type | Description |
---|---|---|---|---|
format |
required | (none) | String | Specify what format to use, here should be 'parquet'. |
parquet.utc-timezone |
optional | false | Boolean | Use UTC timezone or local timezone to the conversion between epoch time and LocalDateTime. Hive 0.x/1.x/2.x use local timezone. But Hive 3.x use UTC timezone. |
Parquet format also supports configuration from ParquetOutputFormat.
For example, you can configure parquet.compression=GZIP
to enable gzip compression.
Currently, Parquet format type mapping is compatible with Apache Hive, but different with Apache Spark:
The following table lists the type mapping from Flink type to Parquet type.
Flink Data Type | Parquet type | Parquet logical type |
---|---|---|
CHAR / VARCHAR / STRING | BINARY | UTF8 |
BOOLEAN | BOOLEAN | |
BINARY / VARBINARY | BINARY | |
DECIMAL | FIXED_LEN_BYTE_ARRAY | DECIMAL |
TINYINT | INT32 | INT_8 |
SMALLINT | INT32 | INT_16 |
INT | INT32 | |
BIGINT | INT64 | |
FLOAT | FLOAT | |
DOUBLE | DOUBLE | |
DATE | INT32 | DATE |
TIME | INT32 | TIME_MILLIS |
TIMESTAMP | INT96 |
Attention Composite data type: Array, Map and Row are not supported.