Db2 CDC Connector #
The Db2 CDC connector allows for reading snapshot data and incremental data from Db2 database. This document describes how to setup the db2 CDC connector to run SQL queries against Db2 databases.
Supported Databases #
Connector | Database | Driver |
---|---|---|
Db2-cdc | Db2 Driver: 11.5.0.0 |
Dependencies #
In order to set up the Db2 CDC connector, the following table provides dependency information for both projects using a build automation tool (such as Maven or SBT) and SQL Client with SQL JAR bundles.
Maven dependency #
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-db2-cdc</artifactId>
<version>3.1.1</version>
</dependency>
SQL Client JAR #
Download flink-sql-connector-db2-cdc-3.1.0.jar and
put it under <FLINK_HOME>/lib/
.
Note: Refer to flink-sql-connector-db2-cdc, more released versions will be available in the Maven central warehouse.
Since Db2 Connector’s IPLA license is incompatible with Flink CDC project, we can’t provide Db2 connector in prebuilt connector jar packages. You may need to configure the following dependencies manually.
Dependency Item | Description |
---|---|
com.ibm.db2.jcc:db2jcc:db2jcc4 | Used for connecting to Db2 database. |
Setup Db2 server #
Follow the steps in the Debezium Db2 Connector.
Notes #
Not support BOOLEAN type in SQL Replication on Db2 #
Only snapshots can be taken from tables with BOOLEAN type columns. Currently, SQL Replication on Db2 does not support BOOLEAN, so Debezium can not perform CDC on those tables. Consider using another type to replace BOOLEAN type.
How to create a Db2 CDC table #
The Db2 CDC table can be defined as following:
-- checkpoint every 3 seconds
Flink SQL> SET 'execution.checkpointing.interval' = '3s';
-- register a Db2 table 'products' in Flink SQL
Flink SQL> CREATE TABLE products (
ID INT NOT NULL,
NAME STRING,
DESCRIPTION STRING,
WEIGHT DECIMAL(10,3),
PRIMARY KEY(ID) NOT ENFORCED
) WITH (
'connector' = 'db2-cdc',
'hostname' = 'localhost',
'port' = '50000',
'username' = 'root',
'password' = '123456',
'database-name' = 'mydb',
'table-name' = 'myschema.products');
-- read snapshot and redo logs from products table
Flink SQL> SELECT * FROM products;
Connector Options #
Option | Required | Default | Type | Description |
---|---|---|---|---|
connector | required | (none) | String | Specify what connector to use, here should be 'db2-cdc' . |
hostname | required | (none) | String | IP address or hostname of the Db2 database server. |
username | required | (none) | String | Name of the Db2 database to use when connecting to the Db2 database server. |
password | required | (none) | String | Password to use when connecting to the Db2 database server. |
database-name | required | (none) | String | Database name of the Db2 server to monitor. |
table-name | required | (none) | String | Table name of the Db2 database to monitor, e.g.: "db1.table1" |
port | optional | 50000 | Integer | Integer port number of the Db2 database server. |
scan.startup.mode | optional | initial | String | Optional startup mode for Db2 CDC consumer, valid enumerations are "initial" and "latest-offset". Please see Startup Reading Position section for more detailed information. |
server-time-zone | optional | (none) | String | The session time zone in database server, e.g. "Asia/Shanghai". It controls how the TIMESTAMP type in Db2 converted to STRING. See more here. If not set, then ZoneId.systemDefault() is used to determine the server time zone. |
scan.incremental.snapshot.enabled | optional | true | Boolean | Whether enable parallelism snapshot. |
chunk-meta.group.size | optional | 1000 | Integer | The group size of chunk meta, if the meta size exceeds the group size, the meta will be divided into multiple groups. |
chunk-key.even-distribution.factor.lower-bound | optional | 0.05d | Double | The lower bound of chunk key distribution factor. The distribution factor is used to determine whether the table is evenly distribution or not. The table chunks would use evenly calculation optimization when the data distribution is even, and the query for splitting would happen when it is uneven. The distribution factor could be calculated by (MAX(id) - MIN(id) + 1) / rowCount. |
chunk-key.even-distribution.factor.upper-bound | optional | 1000.0d | Double | The upper bound of chunk key distribution factor. The distribution factor is used to determine whether the table is evenly distribution or not. The table chunks would use evenly calculation optimization when the data distribution is even, and the query for splitting would happen when it is uneven. The distribution factor could be calculated by (MAX(id) - MIN(id) + 1) / rowCount. |
scan.incremental.snapshot.chunk.key-column | optional | (none) | String | The chunk key of table snapshot, captured tables are split into multiple chunks by a chunk key when read the snapshot of table. By default, the chunk key is the first column of the primary key. This column must be a column of the primary key. |
debezium.* | optional | (none) | String | Pass-through Debezium's properties to Debezium Embedded Engine which is used to capture data changes from
Db2 server.
For example: 'debezium.snapshot.mode' = 'never' .
See more about the Debezium's Db2 Connector properties |
scan.incremental.close-idle-reader.enabled | optional | false | Boolean | Whether to close idle readers at the end of the snapshot phase. The flink version is required to be greater than or equal to 1.14 when 'execution.checkpointing.checkpoints-after-tasks-finish.enabled' is set to true. If the flink version is greater than or equal to 1.15, the default value of 'execution.checkpointing.checkpoints-after-tasks-finish.enabled' has been changed to true, so it does not need to be explicitly configured 'execution.checkpointing.checkpoints-after-tasks-finish.enabled' = 'true' |
Available Metadata #
The following format metadata can be exposed as read-only (VIRTUAL) columns in a table definition.
Key | DataType | Description |
---|---|---|
table_name | STRING NOT NULL | Name of the table that contain the row. |
schema_name | STRING NOT NULL | Name of the schema that contain the row. |
database_name | STRING NOT NULL | Name of the database that contain the row. |
op_ts | TIMESTAMP_LTZ(3) NOT NULL | It indicates the time that the change was made in the database. If the record is read from snapshot of the table instead of the change stream, the value is always 0. |
Features #
Startup Reading Position #
The config option scan.startup.mode
specifies the startup mode for DB2 CDC consumer. The valid enumerations are:
initial
(default): Performs an initial snapshot on the monitored database tables upon first startup, and continue to read the latest redo logs.latest-offset
: Never to perform snapshot on the monitored database tables upon first startup, just read from the end of the redo logs which means only have the changes since the connector was started.
Note: the mechanism of scan.startup.mode
option relying on Debezium’s snapshot.mode
configuration. So please do not using them together. If you speicifying both scan.startup.mode
and debezium.snapshot.mode
options in the table DDL, it may make scan.startup.mode
doesn’t work.
DataStream Source #
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.apache.flink.cdc.debezium.JsonDebeziumDeserializationSchema;
public class Db2SourceExample {
public static void main(String[] args) throws Exception {
SourceFunction<String> db2Source =
Db2Source.<String>builder()
.hostname("yourHostname")
.port(50000)
.database("yourDatabaseName") // set captured database
.tableList("yourSchemaName.yourTableName") // set captured table
.username("yourUsername")
.password("yourPassword")
.deserializer(
new JsonDebeziumDeserializationSchema()) // converts SourceRecord to
// JSON String
.build();
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// enable checkpoint
env.enableCheckpointing(3000);
env.addSource(db2Source)
.print()
.setParallelism(1); // use parallelism 1 for sink to keep message ordering
env.execute("Print Db2 Snapshot + Change Stream");
}
}
The DB2 CDC incremental connector (after 3.1.0) can be used as the following shows:
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.cdc.connectors.base.options.StartupOptions;
import org.apache.flink.cdc.connectors.db2.source.Db2SourceBuilder;
import org.apache.flink.cdc.connectors.db2.source.Db2SourceBuilder.Db2IncrementalSource;
import org.apache.flink.cdc.debezium.JsonDebeziumDeserializationSchema;
public class Db2ParallelSourceExample {
public static void main(String[] args) throws Exception {
Db2IncrementalSource<String> sqlServerSource =
new Db2SourceBuilder()
.hostname("localhost")
.port(50000)
.databaseList("TESTDB")
.tableList("DB2INST1.CUSTOMERS")
.username("flink")
.password("flinkpw")
.deserializer(new JsonDebeziumDeserializationSchema())
.startupOptions(StartupOptions.initial())
.build();
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// enable checkpoint
env.enableCheckpointing(3000);
// set the source parallelism to 2
env.fromSource(sqlServerSource, WatermarkStrategy.noWatermarks(), "Db2IncrementalSource")
.setParallelism(2)
.print()
.setParallelism(1);
env.execute("Print DB2 Snapshot + Change Stream");
}
}
Data Type Mapping #
Db2 type | Flink SQL type | NOTE |
---|---|---|
SMALLINT |
SMALLINT | |
INTEGER | INT | |
BIGINT | BIGINT | |
REAL | FLOAT | |
DOUBLE | DOUBLE | |
NUMERIC(p, s) DECIMAL(p, s) |
DECIMAL(p, s) | |
DATE | DATE | |
TIME | TIME | |
TIMESTAMP [(p)] | TIMESTAMP [(p)] | |
CHARACTER(n) | CHAR(n) | |
VARCHAR(n) | VARCHAR(n) | |
BINARY(n) | BINARY(n) | |
VARBINARY(N) | VARBINARY(N) | |
BLOB CLOB DBCLOB |
BYTES | |
VARGRAPHIC XML |
STRING |