Apache Hive has established itself as a focal point of the data warehousing ecosystem.
It serves as not only a SQL engine for big data analytics and ETL, but also a data management platform, where data is discovered, defined, and evolved.
Flink offers a two-fold integration with Hive.
The first is to leverage Hive’s Metastore as a persistent catalog with Flink’s HiveCatalog for storing Flink specific metadata across sessions.
For example, users can store their Kafka or ElasticSearch tables in Hive Metastore by using HiveCatalog, and reuse them later on in SQL queries.
The second is to offer Flink as an alternative engine for reading and writing Hive tables.
The HiveCatalog is designed to be “out of the box” compatible with existing Hive installations.
You do not need to modify your existing Hive Metastore or change the data placement or partitioning of your tables.
Please note Hive itself have different features available for different versions, and these issues are not caused by Flink:
Hive built-in functions are supported in 1.2.0 and later.
Column constraints, i.e. PRIMARY KEY and NOT NULL, are supported in 3.1.0 and later.
Altering table statistics is supported in 1.2.0 and later.
DATE column statistics are supported in 1.2.0 and later.
Writing to ORC tables is not supported in 2.0.x.
To integrate with Hive, you need to add some extra dependencies to the /lib/ directory in Flink distribution
to make the integration work in Table API program or SQL in SQL Client.
Alternatively, you can put these dependencies in a dedicated folder, and add them to classpath with the -C
or -l option for Table API program or SQL Client respectively.
There are two ways to add Hive dependencies. First is to use Flink’s bundled Hive jars. You can choose a bundled Hive jar according to the version of the metastore you use. Second is to add each of the required jars separately. The second way can be useful if the Hive version you’re using is not listed here.
Using bundled hive jar
The following tables list all available bundled hive jars. You can pick one to the /lib/ directory in Flink distribution.
Please find the required dependencies for different Hive major versions below.
If you are building your own program, you need the following dependencies in your mvn file.
It’s recommended not to include these dependencies in the resulting jar file.
You’re supposed to add dependencies as stated above at runtime.
If the hive-conf/hive-site.xml file is stored in remote storage system, users should download
the hive configuration file to their local environment first.
Please note while HiveCatalog doesn’t require a particular planner, reading/writing Hive tables only works with blink planner.
Therefore it’s highly recommended that you use blink planner when connecting to your Hive warehouse.
HiveCatalog is capable of automatically detecting the Hive version in use. It’s recommended NOT to specify the Hive
version, unless the automatic detection fails.
Following is an example of how to connect to Hive:
Take Hive version 2.3.4 for example:
It’s recommended to use Hive dialect to execute DDLs to create
Hive tables, views, partitions, functions within Flink.