Apache Kafka

Stateful Functions offers an Apache Kafka I/O Module for reading from and writing to Kafka topics. It is based on Apache Flink’s universal Kafka connector and provides exactly-once processing semantics. The Kafka I/O Module is configurable in Yaml or Java.

Dependency

To use the Kafka I/O Module in Java, please include the following dependency in your pom.

<dependency>
	<groupId>org.apache.flink</groupId>
	<artifactId>statefun-kafka-io</artifactId>
	<version>2.1.0</version>
	<scope>provided</scope>
</dependency>

Kafka Ingress Spec

A KafkaIngressSpec declares an ingress spec for consuming from Kafka cluster.

It accepts the following arguments:

  1. The ingress identifier associated with this ingress
  2. The topic name / list of topic names
  3. The address of the bootstrap servers
  4. The consumer group id to use
  5. A KafkaIngressDeserializer for deserializing data from Kafka (Java only)
  6. The position to start consuming from
package org.apache.flink.statefun.docs.io.kafka;

import org.apache.flink.statefun.docs.models.User;
import org.apache.flink.statefun.sdk.io.IngressIdentifier;
import org.apache.flink.statefun.sdk.io.IngressSpec;
import org.apache.flink.statefun.sdk.kafka.KafkaIngressBuilder;
import org.apache.flink.statefun.sdk.kafka.KafkaIngressStartupPosition;

public class IngressSpecs {

  public static final IngressIdentifier<User> ID =
      new IngressIdentifier<>(User.class, "example", "input-ingress");

  public static final IngressSpec<User> kafkaIngress =
      KafkaIngressBuilder.forIdentifier(ID)
          .withKafkaAddress("localhost:9092")
          .withConsumerGroupId("greetings")
          .withTopic("my-topic")
          .withDeserializer(UserDeserializer.class)
          .withStartupPosition(KafkaIngressStartupPosition.fromLatest())
          .build();
}
version: "1.0"

module:
    meta:
    type: remote
spec:
    ingresses:
    - ingress:
        meta:
            type: statefun.kafka.io/routable-protobuf-ingress
            id: example/user-ingress
        spec:
            address: kafka-broker:9092
            consumerGroupId: routable-kafka-e2e
            startupPosition:
                type: earliest
            topics:
              - topic: messages-1
                typeUrl: org.apache.flink.statefun.docs.models.User
                targets:
                  - example-namespace/my-function-1
                  - example-namespace/my-function-2

The ingress also accepts properties to directly configure the Kafka client, using KafkaIngressBuilder#withProperties(Properties). Please refer to the Kafka consumer configuration documentation for the full list of available properties. Note that configuration passed using named methods, such as KafkaIngressBuilder#withConsumerGroupId(String), will have higher precedence and overwrite their respective settings in the provided properties.

Startup Position

The ingress allows configuring the startup position to be one of the following:

From Group Offset (default)

Starts from offsets that were committed to Kafka for the specified consumer group.

KafkaIngressStartupPosition#fromGroupOffsets();
startupPosition:
    type: group-offsets

Earlist

Starts from the earliest offset.

KafkaIngressStartupPosition#fromEarliest();
startupPosition:
    type: earliest

Latest

Starts from the latest offset.

KafkaIngressStartupPosition#fromLatest();
startupPosition:
    type: latest

Specific Offsets

Starts from specific offsets, defined as a map of partitions to their target starting offset.

Map<TopicPartition, Long> offsets = new HashMap<>();
offsets.add(new TopicPartition("user-topic", 0), 91);
offsets.add(new TopicPartition("user-topic", 11), 11);
offsets.add(new TopicPartition("user-topic", 8), 8);

KafkaIngressStartupPosition#fromSpecificOffsets(offsets);
startupPosition:
    type: specific-offsets
    offsets:
        - user-topic/0: 91
        - user-topic/1: 11
        - user-topic/2: 8

Date

Starts from offsets that have an ingestion time larger than or equal to a specified date.

KafkaIngressStartupPosition#fromDate(ZonedDateTime.now());
startupPosition:
    type: date
    date: 2020-02-01 04:15:00.00 Z

On startup, if the specified startup offset for a partition is out-of-range or does not exist (which may be the case if the ingress is configured to start from group offsets, specific offsets, or from a date), then the ingress will fallback to using the position configured using KafkaIngressBuilder#withAutoOffsetResetPosition(KafkaIngressAutoResetPosition). By default, this is set to be the latest position.

Kafka Deserializer

When using the Java api, the Kafka ingress needs to know how to turn the binary data in Kafka into Java objects. The KafkaIngressDeserializer allows users to specify such a schema. The T deserialize(ConsumerRecord<byte[], byte[]> record) method gets called for each Kafka message, passing the key, value, and metadata from Kafka.

package org.apache.flink.statefun.docs.io.kafka;

import com.fasterxml.jackson.databind.ObjectMapper;
import java.io.IOException;
import org.apache.flink.statefun.docs.models.User;
import org.apache.flink.statefun.sdk.kafka.KafkaIngressDeserializer;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class UserDeserializer implements KafkaIngressDeserializer<User> {

	private static Logger LOG = LoggerFactory.getLogger(UserDeserializer.class);

	private final ObjectMapper mapper = new ObjectMapper();

	@Override
	public User deserialize(ConsumerRecord<byte[], byte[]> input) {
		try {
			return mapper.readValue(input.value(), User.class);
		} catch (IOException e) {
			LOG.debug("Failed to deserialize record", e);
			return null;
		}
	}
}

Kafka Egress Spec

A KafkaEgressBuilder declares an egress spec for writing data out to a Kafka cluster.

It accepts the following arguments:

  1. The egress identifier associated with this egress
  2. The address of the bootstrap servers
  3. A KafkaEgressSerializer for serializing data into Kafka (Java only)
  4. The fault tolerance semantic
  5. Properties for the Kafka producer
package org.apache.flink.statefun.docs.io.kafka;

import org.apache.flink.statefun.docs.models.User;
import org.apache.flink.statefun.sdk.io.EgressIdentifier;
import org.apache.flink.statefun.sdk.io.EgressSpec;
import org.apache.flink.statefun.sdk.kafka.KafkaEgressBuilder;

public class EgressSpecs {

  public static final EgressIdentifier<User> ID =
      new EgressIdentifier<>("example", "output-egress", User.class);

  public static final EgressSpec<User> kafkaEgress =
      KafkaEgressBuilder.forIdentifier(ID)
          .withKafkaAddress("localhost:9092")
          .withSerializer(UserSerializer.class)
          .build();
}
version: "1.0"

module:
    meta:
    type: remote
spec:
    egresses:
      - egress:
          meta:
            type: statefun.kafka.io/generic-egress
            id: example/output-messages
          spec:
            address: kafka-broker:9092
            deliverySemantic:
              type: exactly-once
              transactionTimeoutMillis: 100000
            properties:
              - foo.config: bar

Please refer to the Kafka producer configuration documentation for the full list of available properties.

Kafka Egress and Fault Tolerance

With fault tolerance enabled, the Kafka egress can provide exactly-once delivery guarantees. You can choose three different modes of operation.

None

Nothing is guaranteed, produced records can be lost or duplicated.

KafkaEgressBuilder#withNoProducerSemantics();
deliverySemantic:
    type: none

At Least Once

Stateful Functions will guarantee that no records will be lost but they can be duplicated.

KafkaEgressBuilder#withAtLeastOnceProducerSemantics();
deliverySemantic:
    type: at-least-once

Exactly Once

Stateful Functions uses Kafka transactions to provide exactly-once semantics.

KafkaEgressBuilder#withExactlyOnceProducerSemantics(Duration.minutes(15));
deliverySemantic:
    type: exactly-once
    transactionTimeoutMillis: 900000 # 15 min

Kafka Serializer

When using the Java api, the Kafka egress needs to know how to turn Java objects into binary data. The KafkaEgressSerializer allows users to specify such a schema. The ProducerRecord<byte[], byte[]> serialize(T out) method gets called for each message, allowing users to set a key, value, and other metadata.

package org.apache.flink.statefun.docs.io.kafka;

import com.fasterxml.jackson.core.JsonProcessingException;
import com.fasterxml.jackson.databind.ObjectMapper;
import org.apache.flink.statefun.docs.models.User;
import org.apache.flink.statefun.sdk.kafka.KafkaEgressSerializer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class UserSerializer implements KafkaEgressSerializer<User> {

  private static final Logger LOG = LoggerFactory.getLogger(UserSerializer.class);

  private static final String TOPIC = "user-topic";

  private final ObjectMapper mapper = new ObjectMapper();

  @Override
  public ProducerRecord<byte[], byte[]> serialize(User user) {
    try {
      byte[] key = user.getUserId().getBytes();
      byte[] value = mapper.writeValueAsBytes(user);

      return new ProducerRecord<>(TOPIC, key, value);
    } catch (JsonProcessingException e) {
      LOG.info("Failed to serializer user", e);
      return null;
    }
  }
}