How to Migrate from DataSet to DataStream #
The DataSet API has been formally deprecated and will no longer receive active maintenance and support. It will be removed in the Flink 2.0 version. Flink users are recommended to migrate from the DataSet API to the DataStream API, Table API and SQL for their data processing requirements.
Noticed that APIs in DataStream do not always match those in DataSet exactly. The purpose of this document is to help users understand how to achieve the same data processing behaviors with DataStream APIs as using DataSet APIs.
According to the changes in coding and execution efficiency that are required for migration, we categorized DataSet APIs into 4 categories:
-
Category 1: APIs that have exact equivalent in DataStream, which requires barely any changes to migrate.
-
Category 2: APIs whose behavior can be achieved by other APIs with different semantics in DataStream, which might require some code changes for migration but will result in the same execution efficiency.
-
Category 3: APIs whose behavior can be achieved by other APIs with different semantics in DataStream, with potentially additional cost in execution efficiency.
-
Category 4: APIs whose behaviors are not supported by DataStream API.
The subsequent sections will first introduce how to set the execution environment and source/sink, then provide detailed explanations on how to migrate each category of DataSet APIs to the DataStream APIs, highlighting the specific considerations and challenges associated with each category.
Setting the execution environment #
The first step of migrating an application from DataSet API to DataStream API is to replace ExecutionEnvironment
with StreamExecutionEnvironment
.
DataSet | DataStream |
---|---|
|
|
Unlike DataSet, DataStream supports processing on both bounded and unbounded data streams. Thus, user needs to explicitly set the execution mode
to RuntimeExecutionMode.BATCH
if that is expected.
StreamExecutionEnvironment executionEnvironment = // [...];
executionEnvironment.setRuntimeMode(RuntimeExecutionMode.BATCH);
Using the streaming sources and sinks #
Sources #
The DataStream API uses DataStreamSource
to read records from external system, while the DataSet API uses the DataSource
.
DataSet | DataStream |
---|---|
|
|
Sinks #
The DataStream API uses DataStreamSink
to write records to external system, while the
DataSet API uses the DataSink
.
DataSet | DataStream |
---|---|
|
|
If you are looking for pre-defined source and sink connectors of DataStream, please check the Connector Docs
Migrating DataSet APIs #
Category 1 #
For Category 1, these DataSet APIs have exact equivalent in DataStream, which requires barely any changes to migrate.
Operations | DataSet | DataStream |
---|---|---|
Map |
|
|
FlatMap |
|
|
Filter |
|
|
Union |
|
|
Rebalance |
|
|
Project |
|
|
Reduce on Grouped DataSet |
|
|
Aggregate on Grouped DataSet |
|
|
Category 2 #
For category 2, the behavior of these DataSet APIs can be achieved by other APIs with different semantics in DataStream, which might require some code changes for migration but will result in the same execution efficiency.
Operations on a full DataSet correspond to the global window aggregation in DataStream with a custom window that is triggered at the end of the inputs. The EndOfStreamWindows
in the Appendix shows how such a window can be implemented. We will reuse it in the rest of this document.
Operations | DataSet | DataStream |
---|---|---|
Distinct |
|
|
Hash-Partition |
|
|
Reduce on Full DataSet |
|
|
Aggregate on Full DataSet |
|
|
GroupReduce on Full DataSet |
|
|
GroupReduce on Grouped DataSet |
|
|
First-n |
|
|
Join |
|
|
CoGroup |
|
|
OuterJoin |
|
|
Category 3 #
For category 3, the behavior of these DataSet APIs can be achieved by other APIs with different semantics in DataStream, with potentially additional cost in execution efficiency.
Currently, DataStream API does not directly support aggregations on non-keyed streams (subtask-scope aggregations). In order to do so, we need to first assign the subtask id
to the records, then turn the stream into a keyed stream. The AddSubtaskIdMapFunction
in the Appendix shows how
to do that, and we will reuse it in the rest of this document.
Operations | DataSet | DataStream |
---|---|---|
MapPartition/SortPartition |
|
|
Cross |
|
|
Category 4 #
The behaviors of the following DataSet APIs are not supported by DataStream.
- RangePartition
- GroupCombine
Appendix #
EndOfStreamWindows #
The following code shows the example of EndOfStreamWindows
.
public class EndOfStreamWindows extends WindowAssigner<Object, TimeWindow> {
private static final long serialVersionUID = 1L;
private static final EndOfStreamWindows INSTANCE = new EndOfStreamWindows();
private static final TimeWindow TIME_WINDOW_INSTANCE =
new TimeWindow(Long.MIN_VALUE, Long.MAX_VALUE);
private EndOfStreamWindows() {}
public static EndOfStreamWindows get() {
return INSTANCE;
}
@Override
public Collection<TimeWindow> assignWindows(
Object element, long timestamp, WindowAssignerContext context) {
return Collections.singletonList(TIME_WINDOW_INSTANCE);
}
@Override
public Trigger<Object, TimeWindow> getDefaultTrigger(StreamExecutionEnvironment env) {
return new EndOfStreamTrigger();
}
@Override
public String toString() {
return "EndOfStreamWindows()";
}
@Override
public TypeSerializer<TimeWindow> getWindowSerializer(ExecutionConfig executionConfig) {
return new TimeWindow.Serializer();
}
@Override
public boolean isEventTime() {
return true;
}
@Internal
public static class EndOfStreamTrigger extends Trigger<Object, TimeWindow> {
@Override
public TriggerResult onElement(
Object element, long timestamp, TimeWindow window, TriggerContext ctx)
throws Exception {
return TriggerResult.CONTINUE;
}
@Override
public TriggerResult onEventTime(long time, TimeWindow window, TriggerContext ctx) {
return time == window.maxTimestamp() ? TriggerResult.FIRE : TriggerResult.CONTINUE;
}
@Override
public void clear(TimeWindow window, TriggerContext ctx) throws Exception {}
@Override
public TriggerResult onProcessingTime(long time, TimeWindow window, TriggerContext ctx) {
return TriggerResult.CONTINUE;
}
}
}
AddSubtaskIDMapFunction #
The following code shows the example of AddSubtaskIDMapFunction
.
public static class AddSubtaskIDMapFunction<T> extends RichMapFunction<T, Tuple2<String, T>> {
@Override
public Tuple2<String, T> map(T value) {
return Tuple2.of(String.valueOf(getRuntimeContext().getIndexOfThisSubtask()), value);
}
}