DataStream Connectors #
Predefined Sources and Sinks #
A few basic data sources and sinks are built into Flink and are always available. The predefined data sources include reading from files, directories, and sockets, and ingesting data from collections and iterators. The predefined data sinks support writing to files, to stdout and stderr, and to sockets.
Bundled Connectors #
Connectors provide code for interfacing with various third-party systems. Currently these systems are supported:
- Apache Kafka (source/sink)
- Apache Cassandra (sink)
- Amazon DynamoDB (sink)
- Amazon Kinesis Data Streams (source/sink)
- Amazon Kinesis Data Firehose (sink)
- DataGen (source)
- Elasticsearch (sink)
- Opensearch (sink)
- FileSystem (source/sink)
- RabbitMQ (source/sink)
- Google PubSub (source/sink)
- Hybrid Source (source)
- Apache Pulsar (source)
- JDBC (sink)
- MongoDB (source/sink)
Keep in mind that to use one of these connectors in an application, additional third party components are usually required, e.g. servers for the data stores or message queues. Note also that while the streaming connectors listed in this section are part of the Flink project and are included in source releases, they are not included in the binary distributions. Further instructions can be found in the corresponding subsections.
Filesystem source formats are gradually replaced with new Flink Source API starting with Flink 1.14.0.
Connectors in Apache Bahir #
Additional streaming connectors for Flink are being released through Apache Bahir, including:
- Apache ActiveMQ (source/sink)
- Apache Flume (sink)
- Redis (sink)
- Akka (sink)
- Netty (source)
Other Ways to Connect to Flink #
Data Enrichment via Async I/O #
Using a connector isn’t the only way to get data in and out of Flink.
One common pattern is to query an external database or web service in a Map
or FlatMap
in order to enrich the primary datastream.
Flink offers an API for Asynchronous I/O
to make it easier to do this kind of enrichment efficiently and robustly.
Queryable State #
When a Flink application pushes a lot of data to an external data store, this can become an I/O bottleneck. If the data involved has many fewer reads than writes, a better approach can be for an external application to pull from Flink the data it needs. The Queryable State interface enables this by allowing the state being managed by Flink to be queried on demand.