This documentation is for an out-of-date version of Apache Flink. We recommend you use the latest stable version.
Using the HiveCatalog and Flink’s connector to Hive, Flink can read and write from Hive data as an alternative to Hive’s batch engine.
Be sure to follow the instructions to include the correct dependencies in your application.
And please also note that Hive connector only works with blink planner.
You have to use the Hive catalog as your current catalog before you can query views in that catalog. It can be done by either tableEnv.useCatalog(...) in Table API or USE CATALOG ... in SQL Client.
Hive and Flink SQL have different syntax, e.g. different reserved keywords and literals. Make sure the view’s query is compatible with Flink grammar.
Writing To Hive
Similarly, data can be written into hive using an INSERT clause.
Consider there is an example table named “mytable” with two columns: name and age, in string and int type.
We support partitioned table too, Consider there is a partitioned table named myparttable with four columns: name, age, my_type and my_date, in types …… my_type and my_date are the partition keys.
Formats
We have tested on the following of table storage formats: text, csv, SequenceFile, ORC, and Parquet.
Optimizations
Partition Pruning
Flink uses partition pruning as a performance optimization to limits the number of files and partitions
that Flink reads when querying Hive tables. When your data is partitioned, Flink only reads a subset of the partitions in
a Hive table when a query matches certain filter criteria.
Projection Pushdown
Flink leverages projection pushdown to minimize data transfer between Flink and Hive tables by omitting
unnecessary fields from table scans.
It is especially beneficial when a table contains many columns.
Limit Pushdown
For queries with LIMIT clause, Flink will limit the number of output records wherever possible to minimize the
amount of data transferred across network.
ORC Vectorized Optimization upon Read
Optimization is used automatically when the following conditions are met:
Columns without complex data type, like hive types: List, Map, Struct, Union.
Hive version greater than or equal to version 2.0.0.
This feature is turned on by default. If there is a problem, you can use this config option to close ORC Vectorized Optimization:
Roadmap
We are planning and actively working on supporting features like
ACID tables
bucketed tables
more formats
Please reach out to the community for more feature request https://flink.apache.org/community.html#mailing-lists