This documentation is for an unreleased version of Apache Flink Table Store. We recommend you use the latest stable version.

Overview #

Flink Table Store is a unified storage to build dynamic tables for both streaming and batch processing in Flink, supporting high-speed data ingestion and timely data query.

Architecture #

As shown in the architecture above:

Read/Write: Table Store supports a versatile way to read/write data and perform OLAP queries.

For reads, it supports consuming data
- from historical snapshots (in batch mode),
- from the latest offset (in streaming mode), or
- reading incremental snapshots in a hybrid way.
For writes, it supports streaming synchronization from the changelog of databases (CDC) or batch insert/overwrite from offline data.

Ecosystem: In addition to Apache Flink, Table Store also supports read by other computation engines like Apache Hive, Apache Spark and Trino.

Internal: Under the hood, Table Store uses a hybrid storage architecture with a lake format to store historical data and a queue system to store incremental data. The former stores the columnar files on the filesystem/object-store and uses the LSM tree structure to support a large volume of data updates and high-performance queries. The latter uses Apache Kafka to capture data in real-time.

Unified Storage #

There are three types of connectors in Flink SQL.

Message queue, such as Apache Kafka, it is used in both source and intermediate stages in this pipeline, to guarantee the latency stay within seconds.
OLAP system, such as Clickhouse, it receives processed data in streaming fashion and serving user’s ad-hoc queries.
Batch storage, such as Apache Hive, it supports various operations of the traditional batch processing, including INSERT OVERWRITE.

Flink Table Store provides table abstraction. It is used in a way that does not differ from the traditional database:

In Flink batch execution mode, it acts like a Hive table and supports various operations of Batch SQL. Query it to see the latest snapshot.
In Flink streaming execution mode, it acts like a message queue. Query it acts like querying a stream changelog from a message queue where historical data never expires.