State Bootstrapping

State Bootstrapping #

Often times applications require some intial state provided by historical data in a file, database, or other system. Because state is managed by Apache Flink’s snapshotting mechanism, for Stateful Function applications, that means writing the intial state into a savepoint that can be used to start the job. Users can bootstrap initial state for Stateful Functions applications using Flink’s State Processor API and a StatefulFunctionSavepointCreator.

Attention: The savepoint creator currently only supports initializing the state for embedded Java functions.

To get started, include the following libraries in your application:

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>statefun-flink-state-processor</artifactId>
    <version>3.2.0</version>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-state-processor-api_2.12</artifactId>
    <version>3.2.0</version>
</dependency>

State Bootstrap Function #

A StateBootstrapFunction defines how to bootstrap state for a StatefulFunction instance with a given input.

Each bootstrap functions instance directly corresponds to a StatefulFunction type. Likewise, each instance is uniquely identified by an address, represented by the type and id of the function being bootstrapped. Any state that is persisted by a bootstrap functions instance will be available to the corresponding live StatefulFunction instance having the same address.

For example, consider the following state bootstrap function:

public class MyStateBootstrapFunction implements StateBootstrapFunction {

	@Persisted
	private PersistedValue<MyState> state = PersistedValue.of("my-state", MyState.class);

	@Override
	public void bootstrap(Context context, Object input) {
		state.set(extractStateFromInput(input));
	}
 }

Assume that this bootstrap function was provided for function type MyFunctionType, and the id of the bootstrap function instance was id-13. The function writes persisted state of name my-state using the given bootstrap data. After restoring a Stateful Functions application from the savepoint generated using this bootstrap function, the stateful function instance with address (MyFunctionType, id-13) will already have state values available under state name my-state.

Creating A Savepoint #

Savepoints are created by defining certain metadata, such as max parallelism and state backend. The default state backend is RocksDB.

int maxParallelism = 128;
StatefulFunctionsSavepointCreator newSavepoint = new StatefulFunctionsSavepointCreator(maxParallelism);

Each input data set is registered in the savepoint creator with a [router]({{ site.baseurl }}/io-module/index.html#router) that routes each record to zero or more function instances. You may then register any number of function types to the savepoint creator, similar to how functions are registered within a stateful functions module. Finally, specify an output location for the resulting savepoint.

// Read data from a file, database, or other location
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

final DataSet<Tuple2<String, Integer>> userSeenCounts = env.fromElements(
	Tuple2.of("foo", 4), Tuple2.of("bar", 3), Tuple2.of("joe", 2));

// Register the dataset with a router
newSavepoint.withBootstrapData(userSeenCounts, MyStateBootstrapFunctionRouter::new);

// Register a bootstrap function to process the records
newSavepoint.withStateBootstrapFunctionProvider(
		new FunctionType("apache", "my-function"),
		ignored -> new MyStateBootstrapFunction());

newSavepoint.write("file:///savepoint/path/");

env.execute();

For full details of how to use Flink’s DataSet API, please check the official documentation.

Deployment #

After creating a new savpepoint, it can be used to provide the initial state for a Stateful Functions application.

When deploying based on an image, pass the -s command to the Flink JobMaster image.

version: "2.1"
services:
  master:
    image: my-statefun-application-image
    command: -s file:///savepoint/path

When deploying to a Flink session cluster, specify the savepoint argument in the Flink CLI.

$ ./bin/flink run -s file:///savepoint/path stateful-functions-job.jar