FlinkKafkaShuffle (Flink : 1.12-SNAPSHOT API)

java.lang.Object
- org.apache.flink.streaming.connectors.kafka.shuffle.FlinkKafkaShuffle

@Experimental
public class FlinkKafkaShuffle
extends Object

FlinkKafkaShuffle uses Kafka as a message bus to shuffle and persist data at the same time.

Persisting shuffle data is useful when - you would like to reuse the shuffle data and/or, - you would like to avoid a full restart of a pipeline during failure recovery

Persisting shuffle is achieved by wrapping a FlinkKafkaShuffleProducer and a FlinkKafkaShuffleConsumer together into a FlinkKafkaShuffle. Here is an example how to use a FlinkKafkaShuffle.


 StreamExecutionEnvironment env = ... 					// create execution environment
 	DataStream<X> source = env.addSource(...)				// add data stream source
 	DataStream<Y> dataStream = ...							// some transformation(s) based on source

 KeyedStream<Y, KEY> keyedStream = FlinkKafkaShuffle
 	.persistentKeyBy(									// keyBy shuffle through kafka
 			dataStream,										// data stream to be shuffled
 			topic,											// Kafka topic written to
 			producerParallelism,							// the number of tasks of a Kafka Producer
 			numberOfPartitions,								// the number of partitions of the Kafka topic written to
 			kafkaProperties,								// kafka properties for Kafka Producer and Consumer
 			keySelector<Y, KEY>);							// key selector to retrieve key from `dataStream'

 keyedStream.transform...								// some other transformation(s)

 	KeyedStream<Y, KEY> keyedStreamReuse = FlinkKafkaShuffle
 		.readKeyBy(											// Read the Kafka shuffle data again for other usages
 			topic,											// the topic of Kafka where data is persisted
 			env,											// execution environment, and it can be a new environment
 			typeInformation<Y>,								// type information of the data persisted in Kafka
 			kafkaProperties,								// kafka properties for Kafka Consumer
 			keySelector<Y, KEY>);							// key selector to retrieve key

 	keyedStreamReuse.transform...							// some other transformation(s)

Usage of persistentKeyBy(org.apache.flink.streaming.api.datastream.DataStream<T>, java.lang.String, int, int, java.util.Properties, org.apache.flink.api.java.functions.KeySelector<T, K>) is similar to DataStream.keyBy(KeySelector). The differences are:

1). Partitioning is done through FlinkKafkaShuffleProducer. FlinkKafkaShuffleProducer decides which partition a key goes when writing to Kafka

2). Shuffle data can be reused through readKeyBy(java.lang.String, org.apache.flink.streaming.api.environment.StreamExecutionEnvironment, org.apache.flink.api.common.typeinfo.TypeInformation<T>, java.util.Properties, org.apache.flink.api.java.functions.KeySelector<T, K>), as shown in the example above.

3). Job execution is decoupled by the persistent Kafka message bus. In the example, the job execution graph is decoupled to three regions: `KafkaShuffleProducer', `KafkaShuffleConsumer' and `KafkaShuffleConsumerReuse' through `PERSISTENT DATA` as shown below. If any region fails the execution, the other two keep progressing.

     source -> ... KafkaShuffleProducer -> PERSISTENT DATA -> KafkaShuffleConsumer -> ...
                                                |
                                                | ----------> KafkaShuffleConsumerReuse -> ...

Constructor Summary

Constructors
Constructor and Description

FlinkKafkaShuffle()

Constructors
Constructor and Description
`FlinkKafkaShuffle()`

Method Summary

All Methods Static Methods Concrete Methods
Modifier and Type	Method and Description
`static <T> KeyedStream<T,Tuple>`	`persistentKeyBy(DataStream<T> dataStream, String topic, int producerParallelism, int numberOfPartitions, Properties properties, int... fields)` Uses Kafka as a message bus to persist keyBy shuffle.
`static <T,K> KeyedStream<T,K>`	`persistentKeyBy(DataStream<T> dataStream, String topic, int producerParallelism, int numberOfPartitions, Properties properties, KeySelector<T,K> keySelector)` Uses Kafka as a message bus to persist keyBy shuffle.
`static <T,K> KeyedStream<T,K>`	`readKeyBy(String topic, StreamExecutionEnvironment env, TypeInformation<T> typeInformation, Properties kafkaProperties, KeySelector<T,K> keySelector)` The read side of `persistentKeyBy(org.apache.flink.streaming.api.datastream.DataStream<T>, java.lang.String, int, int, java.util.Properties, org.apache.flink.api.java.functions.KeySelector<T, K>)`.
`static <T> void`	`writeKeyBy(DataStream<T> dataStream, String topic, Properties kafkaProperties, int... fields)` The write side of `persistentKeyBy(org.apache.flink.streaming.api.datastream.DataStream<T>, java.lang.String, int, int, java.util.Properties, org.apache.flink.api.java.functions.KeySelector<T, K>)`.
`static <T,K> void`	`writeKeyBy(DataStream<T> dataStream, String topic, Properties kafkaProperties, KeySelector<T,K> keySelector)` The write side of `persistentKeyBy(org.apache.flink.streaming.api.datastream.DataStream<T>, java.lang.String, int, int, java.util.Properties, org.apache.flink.api.java.functions.KeySelector<T, K>)`.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - FlinkKafkaShuffle
```
public FlinkKafkaShuffle()
```
- Method Detail
  - persistentKeyBy
```
public static <T,K> KeyedStream<T,K> persistentKeyBy(DataStream<T> dataStream,
                                                     String topic,
                                                     int producerParallelism,
                                                     int numberOfPartitions,
                                                     Properties properties,
                                                     KeySelector<T,K> keySelector)
```
    Uses Kafka as a message bus to persist keyBy shuffle.
    Persisting keyBy shuffle is achieved by wrapping a FlinkKafkaShuffleProducer and FlinkKafkaShuffleConsumer together.
    On the producer side, FlinkKafkaShuffleProducer is similar to DataStream.keyBy(KeySelector). They use the same key group assignment function KeyGroupRangeAssignment.assignKeyToParallelOperator(java.lang.Object, int, int) to decide which partition a key goes. Hence, each producer task can potentially write to each Kafka partition based on where the key goes. Here, `numberOfPartitions` equals to the key group size. In the case of using TimeCharacteristic.EventTime, each producer task broadcasts its watermark to ALL of the Kafka partitions to make sure watermark information is propagated correctly.
    On the consumer side, each consumer task should read partitions equal to the key group indices it is assigned. `numberOfPartitions` is the maximum parallelism of the consumer. This version only supports numberOfPartitions = consumerParallelism. In the case of using TimeCharacteristic.EventTime, a consumer task is responsible to emit watermarks. Watermarks are read from the corresponding Kafka partitions. Notice that a consumer task only starts to emit a watermark after reading at least one watermark from each producer task to make sure watermarks are monotonically increasing. Hence a consumer task needs to know `producerParallelism` as well.
    
    Type Parameters:
    
    T - Type of the input data stream
    
    K - Type of key
    
    Parameters:
    
    dataStream - Data stream to be shuffled
    
    topic - Kafka topic written to
    
    producerParallelism - Parallelism of producer
    
    numberOfPartitions - Number of partitions
    
    properties - Kafka properties
    
    keySelector - Key selector to retrieve key from `dataStream'
    
    See Also:
    
    writeKeyBy(org.apache.flink.streaming.api.datastream.DataStream<T>, java.lang.String, java.util.Properties, org.apache.flink.api.java.functions.KeySelector<T, K>), readKeyBy(java.lang.String, org.apache.flink.streaming.api.environment.StreamExecutionEnvironment, org.apache.flink.api.common.typeinfo.TypeInformation<T>, java.util.Properties, org.apache.flink.api.java.functions.KeySelector<T, K>)
  - persistentKeyBy
```
public static <T> KeyedStream<T,Tuple> persistentKeyBy(DataStream<T> dataStream,
                                                       String topic,
                                                       int producerParallelism,
                                                       int numberOfPartitions,
                                                       Properties properties,
                                                       int... fields)
```
    Uses Kafka as a message bus to persist keyBy shuffle.
    Persisting keyBy shuffle is achieved by wrapping a FlinkKafkaShuffleProducer and FlinkKafkaShuffleConsumer together.
    On the producer side, FlinkKafkaShuffleProducer is similar to DataStream.keyBy(KeySelector). They use the same key group assignment function KeyGroupRangeAssignment.assignKeyToParallelOperator(java.lang.Object, int, int) to decide which partition a key goes. Hence, each producer task can potentially write to each Kafka partition based on where the key goes. Here, `numberOfPartitions` equals to the key group size. In the case of using TimeCharacteristic.EventTime, each producer task broadcasts its watermark to ALL of the Kafka partitions to make sure watermark information is propagated correctly.
    On the consumer side, each consumer task should read partitions equal to the key group indices it is assigned. `numberOfPartitions` is the maximum parallelism of the consumer. This version only supports numberOfPartitions = consumerParallelism. In the case of using TimeCharacteristic.EventTime, a consumer task is responsible to emit watermarks. Watermarks are read from the corresponding Kafka partitions. Notice that a consumer task only starts to emit a watermark after reading at least one watermark from each producer task to make sure watermarks are monotonically increasing. Hence a consumer task needs to know `producerParallelism` as well.
    
    Type Parameters:
    
    T - Type of the input data stream
    
    Parameters:
    
    dataStream - Data stream to be shuffled
    
    topic - Kafka topic written to
    
    producerParallelism - Parallelism of producer
    
    numberOfPartitions - Number of partitions
    
    properties - Kafka properties
    
    fields - Key positions from the input data stream
    
    See Also:
    
    writeKeyBy(org.apache.flink.streaming.api.datastream.DataStream<T>, java.lang.String, java.util.Properties, org.apache.flink.api.java.functions.KeySelector<T, K>), readKeyBy(java.lang.String, org.apache.flink.streaming.api.environment.StreamExecutionEnvironment, org.apache.flink.api.common.typeinfo.TypeInformation<T>, java.util.Properties, org.apache.flink.api.java.functions.KeySelector<T, K>)
  - writeKeyBy
```
public static <T,K> void writeKeyBy(DataStream<T> dataStream,
                                    String topic,
                                    Properties kafkaProperties,
                                    KeySelector<T,K> keySelector)
```
    The write side of persistentKeyBy(org.apache.flink.streaming.api.datastream.DataStream<T>, java.lang.String, int, int, java.util.Properties, org.apache.flink.api.java.functions.KeySelector<T, K>).
    This function contains a FlinkKafkaShuffleProducer to shuffle and persist data in Kafka. FlinkKafkaShuffleProducer uses the same key group assignment function KeyGroupRangeAssignment.assignKeyToParallelOperator(java.lang.Object, int, int) to decide which partition a key goes. Hence, each producer task can potentially write to each Kafka partition based on the key. Here, the number of partitions equals to the key group size. In the case of using TimeCharacteristic.EventTime, each producer task broadcasts each watermark to all of the Kafka partitions to make sure watermark information is propagated properly.
    Attention: make sure kafkaProperties include PRODUCER_PARALLELISM and PARTITION_NUMBER explicitly. PRODUCER_PARALLELISM is the parallelism of the producer. PARTITION_NUMBER is the number of partitions. They are not necessarily the same and allowed to be set independently.
    
    Type Parameters:
    
    T - Type of the input data stream
    
    K - Type of key
    
    Parameters:
    
    dataStream - Data stream to be shuffled
    
    topic - Kafka topic written to
    
    kafkaProperties - Kafka properties for Kafka Producer
    
    keySelector - Key selector to retrieve key from `dataStream'
    
    See Also:
    
    persistentKeyBy(org.apache.flink.streaming.api.datastream.DataStream<T>, java.lang.String, int, int, java.util.Properties, org.apache.flink.api.java.functions.KeySelector<T, K>), readKeyBy(java.lang.String, org.apache.flink.streaming.api.environment.StreamExecutionEnvironment, org.apache.flink.api.common.typeinfo.TypeInformation<T>, java.util.Properties, org.apache.flink.api.java.functions.KeySelector<T, K>)
  - writeKeyBy
```
public static <T> void writeKeyBy(DataStream<T> dataStream,
                                  String topic,
                                  Properties kafkaProperties,
                                  int... fields)
```
    The write side of persistentKeyBy(org.apache.flink.streaming.api.datastream.DataStream<T>, java.lang.String, int, int, java.util.Properties, org.apache.flink.api.java.functions.KeySelector<T, K>).
    This function contains a FlinkKafkaShuffleProducer to shuffle and persist data in Kafka. FlinkKafkaShuffleProducer uses the same key group assignment function KeyGroupRangeAssignment.assignKeyToParallelOperator(java.lang.Object, int, int) to decide which partition a key goes.
    Hence, each producer task can potentially write to each Kafka partition based on the key. Here, the number of partitions equals to the key group size. In the case of using TimeCharacteristic.EventTime, each producer task broadcasts each watermark to all of the Kafka partitions to make sure watermark information is propagated properly.
    Attention: make sure kafkaProperties include PRODUCER_PARALLELISM and PARTITION_NUMBER explicitly. PRODUCER_PARALLELISM is the parallelism of the producer. PARTITION_NUMBER is the number of partitions. They are not necessarily the same and allowed to be set independently.
    
    Type Parameters:
    
    T - Type of the input data stream
    
    Parameters:
    
    dataStream - Data stream to be shuffled
    
    topic - Kafka topic written to
    
    kafkaProperties - Kafka properties for Kafka Producer
    
    fields - Key positions from the input data stream
    
    See Also:
    
    persistentKeyBy(org.apache.flink.streaming.api.datastream.DataStream<T>, java.lang.String, int, int, java.util.Properties, org.apache.flink.api.java.functions.KeySelector<T, K>), readKeyBy(java.lang.String, org.apache.flink.streaming.api.environment.StreamExecutionEnvironment, org.apache.flink.api.common.typeinfo.TypeInformation<T>, java.util.Properties, org.apache.flink.api.java.functions.KeySelector<T, K>)
  - readKeyBy
```
public static <T,K> KeyedStream<T,K> readKeyBy(String topic,
                                               StreamExecutionEnvironment env,
                                               TypeInformation<T> typeInformation,
                                               Properties kafkaProperties,
                                               KeySelector<T,K> keySelector)
```
    The read side of persistentKeyBy(org.apache.flink.streaming.api.datastream.DataStream<T>, java.lang.String, int, int, java.util.Properties, org.apache.flink.api.java.functions.KeySelector<T, K>).
    Each consumer task should read kafka partitions equal to the key group indices it is assigned. The number of kafka partitions is the maximum parallelism of the consumer. This version only supports numberOfPartitions = consumerParallelism. In the case of using TimeCharacteristic.EventTime, a consumer task is responsible to emit watermarks. Watermarks are read from the corresponding Kafka partitions. Notice that a consumer task only starts to emit a watermark after receiving at least one watermark from each producer task to make sure watermarks are monotonically increasing. Hence a consumer task needs to know `producerParallelism` as well.
    Attention: make sure kafkaProperties include PRODUCER_PARALLELISM and PARTITION_NUMBER explicitly. PRODUCER_PARALLELISM is the parallelism of the producer. PARTITION_NUMBER is the number of partitions. They are not necessarily the same and allowed to be set independently.
    
    Type Parameters:
    
    T - Schema type
    
    K - Key type
    
    Parameters:
    
    topic - The topic of Kafka where data is persisted
    
    env - Execution environment. readKeyBy's environment can be different from writeKeyBy's
    
    typeInformation - Type information of the data persisted in Kafka
    
    kafkaProperties - kafka properties for Kafka Consumer
    
    keySelector - key selector to retrieve key
    
    Returns:
    
    Keyed data stream
    
    See Also:
    
    persistentKeyBy(org.apache.flink.streaming.api.datastream.DataStream<T>, java.lang.String, int, int, java.util.Properties, org.apache.flink.api.java.functions.KeySelector<T, K>), writeKeyBy(org.apache.flink.streaming.api.datastream.DataStream<T>, java.lang.String, java.util.Properties, org.apache.flink.api.java.functions.KeySelector<T, K>)

Back to Flink Website

Class FlinkKafkaShuffle

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Detail

FlinkKafkaShuffle

Method Detail

persistentKeyBy

persistentKeyBy

writeKeyBy

writeKeyBy

readKeyBy