This documentation is for an unreleased version of Apache Flink Table Store. We recommend you use the latest stable version.
Overview #
Flink Table Store is a unified storage to build dynamic tables for both streaming and batch processing in Flink, supporting high-speed data ingestion and timely data query.
Architecture #
As shown in the architecture above:
Read/Write: Table Store supports a versatile way to read/write data and perform OLAP queries.
- For reads, it supports consuming data
- from historical snapshots (in batch mode),
- from the latest offset (in streaming mode), or
- reading incremental snapshots in a hybrid way.
- For writes, it supports streaming synchronization from the changelog of databases (CDC) or batch insert/overwrite from offline data.
Ecosystem: In addition to Apache Flink, Table Store also supports read by other computation engines like Apache Hive, Apache Spark and Trino.
Internal: Under the hood, Table Store uses a hybrid storage architecture with a lake format to store historical data and a queue system to store incremental data. The former stores the columnar files on the filesystem/object-store and uses the LSM tree structure to support a large volume of data updates and high-performance queries. The latter uses Apache Kafka to capture data in real-time.
Unified Storage #
There are three types of connectors in Flink SQL.
- Message queue, such as Apache Kafka, it is used in both source and intermediate stages in this pipeline, to guarantee the latency stay within seconds.
- OLAP system, such as Clickhouse, it receives processed data in streaming fashion and serving user’s ad-hoc queries.
- Batch storage, such as Apache Hive, it supports various operations
of the traditional batch processing, including
INSERT OVERWRITE
.
Flink Table Store provides table abstraction. It is used in a way that does not differ from the traditional database:
- In Flink
batch
execution mode, it acts like a Hive table and supports various operations of Batch SQL. Query it to see the latest snapshot. - In Flink
streaming
execution mode, it acts like a message queue. Query it acts like querying a stream changelog from a message queue where historical data never expires.