Stateful Functions offers an Apache Kafka I/O Module for reading from and writing to Kafka topics. It is based on Apache Flink’s universal Kafka connector and provides exactly-once processing semantics. The Kafka I/O Module is configurable in Yaml or Java.
To use the Kafka I/O Module in Java, please include the following dependency in your pom.
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>statefun-kafka-io</artifactId>
<version>2.2.2</version>
<scope>provided</scope>
</dependency>
A KafkaIngressSpec
declares an ingress spec for consuming from Kafka cluster.
It accepts the following arguments:
KafkaIngressDeserializer
for deserializing data from Kafka (Java only)version: "1.0"
module:
meta:
type: remote
spec:
ingresses:
- ingress:
meta:
type: statefun.kafka.io/routable-protobuf-ingress
id: example/user-ingress
spec:
address: kafka-broker:9092
consumerGroupId: routable-kafka-e2e
startupPosition:
type: earliest
topics:
- topic: messages-1
typeUrl: org.apache.flink.statefun.docs.models.User
targets:
- example-namespace/my-function-1
- example-namespace/my-function-2
package org.apache.flink.statefun.docs.io.kafka;
import org.apache.flink.statefun.docs.models.User;
import org.apache.flink.statefun.sdk.io.IngressIdentifier;
import org.apache.flink.statefun.sdk.io.IngressSpec;
import org.apache.flink.statefun.sdk.kafka.KafkaIngressBuilder;
import org.apache.flink.statefun.sdk.kafka.KafkaIngressStartupPosition;
public class IngressSpecs {
public static final IngressIdentifier<User> ID =
new IngressIdentifier<>(User.class, "example", "input-ingress");
public static final IngressSpec<User> kafkaIngress =
KafkaIngressBuilder.forIdentifier(ID)
.withKafkaAddress("localhost:9092")
.withConsumerGroupId("greetings")
.withTopic("my-topic")
.withDeserializer(UserDeserializer.class)
.withStartupPosition(KafkaIngressStartupPosition.fromLatest())
.build();
}
The ingress also accepts properties to directly configure the Kafka client, using KafkaIngressBuilder#withProperties(Properties)
.
Please refer to the Kafka consumer configuration documentation for the full list of available properties.
Note that configuration passed using named methods, such as KafkaIngressBuilder#withConsumerGroupId(String)
, will have higher precedence and overwrite their respective settings in the provided properties.
The ingress allows configuring the startup position to be one of the following:
Starts from offsets that were committed to Kafka for the specified consumer group.
startupPosition:
type: group-offsets
KafkaIngressStartupPosition#fromGroupOffsets();
Starts from the earliest offset.
startupPosition:
type: earliest
KafkaIngressStartupPosition#fromEarliest();
Starts from the latest offset.
startupPosition:
type: latest
KafkaIngressStartupPosition#fromLatest();
Starts from specific offsets, defined as a map of partitions to their target starting offset.
startupPosition:
type: specific-offsets
offsets:
- user-topic/0: 91
- user-topic/1: 11
- user-topic/2: 8
Map<TopicPartition, Long> offsets = new HashMap<>();
offsets.add(new TopicPartition("user-topic", 0), 91);
offsets.add(new TopicPartition("user-topic", 11), 11);
offsets.add(new TopicPartition("user-topic", 8), 8);
KafkaIngressStartupPosition#fromSpecificOffsets(offsets);
Starts from offsets that have an ingestion time larger than or equal to a specified date.
startupPosition:
type: date
date: 2020-02-01 04:15:00.00 Z
KafkaIngressStartupPosition#fromDate(ZonedDateTime.now());
On startup, if the specified startup offset for a partition is out-of-range or does not exist (which may be the case if the ingress is configured to start from group offsets, specific offsets, or from a date), then the ingress will fallback to using the position configured using KafkaIngressBuilder#withAutoOffsetResetPosition(KafkaIngressAutoResetPosition)
.
By default, this is set to be the latest position.
When using the Java api, the Kafka ingress needs to know how to turn the binary data in Kafka into Java objects.
The KafkaIngressDeserializer
allows users to specify such a schema.
The T deserialize(ConsumerRecord<byte[], byte[]> record)
method gets called for each Kafka message, passing the key, value, and metadata from Kafka.
package org.apache.flink.statefun.docs.io.kafka;
import com.fasterxml.jackson.databind.ObjectMapper;
import java.io.IOException;
import org.apache.flink.statefun.docs.models.User;
import org.apache.flink.statefun.sdk.kafka.KafkaIngressDeserializer;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class UserDeserializer implements KafkaIngressDeserializer<User> {
private static Logger LOG = LoggerFactory.getLogger(UserDeserializer.class);
private final ObjectMapper mapper = new ObjectMapper();
@Override
public User deserialize(ConsumerRecord<byte[], byte[]> input) {
try {
return mapper.readValue(input.value(), User.class);
} catch (IOException e) {
LOG.debug("Failed to deserialize record", e);
return null;
}
}
}
A KafkaEgressBuilder
declares an egress spec for writing data out to a Kafka cluster.
It accepts the following arguments:
KafkaEgressSerializer
for serializing data into Kafka (Java only)version: "1.0"
module:
meta:
type: remote
spec:
egresses:
- egress:
meta:
type: statefun.kafka.io/generic-egress
id: example/output-messages
spec:
address: kafka-broker:9092
deliverySemantic:
type: exactly-once
transactionTimeoutMillis: 100000
properties:
- foo.config: bar
package org.apache.flink.statefun.docs.io.kafka;
import org.apache.flink.statefun.docs.models.User;
import org.apache.flink.statefun.sdk.io.EgressIdentifier;
import org.apache.flink.statefun.sdk.io.EgressSpec;
import org.apache.flink.statefun.sdk.kafka.KafkaEgressBuilder;
public class EgressSpecs {
public static final EgressIdentifier<User> ID =
new EgressIdentifier<>("example", "output-egress", User.class);
public static final EgressSpec<User> kafkaEgress =
KafkaEgressBuilder.forIdentifier(ID)
.withKafkaAddress("localhost:9092")
.withSerializer(UserSerializer.class)
.build();
}
Please refer to the Kafka producer configuration documentation for the full list of available properties.
With fault tolerance enabled, the Kafka egress can provide exactly-once delivery guarantees. You can choose three different modes of operation.
Nothing is guaranteed, produced records can be lost or duplicated.
deliverySemantic:
type: none
KafkaEgressBuilder#withNoProducerSemantics();
Stateful Functions will guarantee that no records will be lost but they can be duplicated.
deliverySemantic:
type: at-least-once
KafkaEgressBuilder#withAtLeastOnceProducerSemantics();
Stateful Functions uses Kafka transactions to provide exactly-once semantics.
deliverySemantic:
type: exactly-once
transactionTimeoutMillis: 900000 # 15 min
KafkaEgressBuilder#withExactlyOnceProducerSemantics(Duration.minutes(15));
When using the Java api, the Kafka egress needs to know how to turn Java objects into binary data.
The KafkaEgressSerializer
allows users to specify such a schema.
The ProducerRecord<byte[], byte[]> serialize(T out)
method gets called for each message, allowing users to set a key, value, and other metadata.
package org.apache.flink.statefun.docs.io.kafka;
import com.fasterxml.jackson.core.JsonProcessingException;
import com.fasterxml.jackson.databind.ObjectMapper;
import org.apache.flink.statefun.docs.models.User;
import org.apache.flink.statefun.sdk.kafka.KafkaEgressSerializer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class UserSerializer implements KafkaEgressSerializer<User> {
private static final Logger LOG = LoggerFactory.getLogger(UserSerializer.class);
private static final String TOPIC = "user-topic";
private final ObjectMapper mapper = new ObjectMapper();
@Override
public ProducerRecord<byte[], byte[]> serialize(User user) {
try {
byte[] key = user.getUserId().getBytes();
byte[] value = mapper.writeValueAsBytes(user);
return new ProducerRecord<>(TOPIC, key, value);
} catch (JsonProcessingException e) {
LOG.info("Failed to serializer user", e);
return null;
}
}
}