public interface OperatorCoordinator extends CheckpointListener, AutoCloseable
Operator coordinators are for example source and sink coordinators that discover and assign work, or aggregate and commit metadata.
All coordinator methods are called by the Job Manager's main thread (mailbox thread). That means that these methods must not, under any circumstances, perform blocking operations (like I/O or waiting on locks or futures). That would run a high risk of bringing down the entire JobManager.
Coordinators that involve more complex operations should hence spawn threads to handle the I/O
work. The methods on the OperatorCoordinator.Context
are safe to be called from another thread than the
thread that calls the Coordinator's methods.
Modifier and Type | Interface and Description |
---|---|
static interface |
OperatorCoordinator.Context
The context gives the OperatorCoordinator access to contextual information and provides a
gateway to interact with other components, such as sending operator events.
|
static interface |
OperatorCoordinator.Provider
The provider creates an OperatorCoordinator and takes a
OperatorCoordinator.Context to pass to the
OperatorCoordinator. |
Modifier and Type | Field and Description |
---|---|
static long |
NO_CHECKPOINT
The checkpoint ID passed to the restore methods when no completed checkpoint exists, yet.
|
Modifier and Type | Method and Description |
---|---|
void |
checkpointCoordinator(long checkpointId,
CompletableFuture<byte[]> resultFuture)
Takes a checkpoint of the coordinator.
|
void |
close()
This method is called when the coordinator is disposed.
|
void |
handleEventFromOperator(int subtask,
OperatorEvent event)
Hands an OperatorEvent from a task (on the Task Manager) to this coordinator.
|
default void |
notifyCheckpointAborted(long checkpointId)
We override the method here to remove the checked exception.
|
void |
notifyCheckpointComplete(long checkpointId)
We override the method here to remove the checked exception.
|
void |
resetToCheckpoint(long checkpointId,
byte[] checkpointData)
Resets the coordinator to the given checkpoint.
|
void |
start()
Starts the coordinator.
|
void |
subtaskFailed(int subtask,
Throwable reason)
Called when one of the subtasks of the task running the coordinated operator goes through a
failover (failure / recovery cycle).
|
void |
subtaskReset(int subtask,
long checkpointId)
Called if a task is recovered as part of a partial failover, meaning a failover
handled by the scheduler's failover strategy (by default recovering a pipelined region).
|
static final long NO_CHECKPOINT
void start() throws Exception
Exception
- Any exception thrown from this method causes a full job failure.void close() throws Exception
close
in interface AutoCloseable
Exception
void handleEventFromOperator(int subtask, OperatorEvent event) throws Exception
Exception
- Any exception thrown by this method results in a full job failure and
recovery.void checkpointCoordinator(long checkpointId, CompletableFuture<byte[]> resultFuture) throws Exception
To confirm the checkpoint and store state in it, the given CompletableFuture
must
be completed with the state. To abort or dis-confirm the checkpoint, the given CompletableFuture
must be completed exceptionally. In any case, the given CompletableFuture
must be completed in some way, otherwise the checkpoint will not progress.
The semantics are defined as follows:
Exception
void notifyCheckpointComplete(long checkpointId)
CheckpointListener.notifyCheckpointComplete(long)
for more detail semantic of the
method.notifyCheckpointComplete
in interface CheckpointListener
checkpointId
- The ID of the checkpoint that has been completed.default void notifyCheckpointAborted(long checkpointId)
CheckpointListener.notifyCheckpointAborted(long)
for more detail semantic of the
method.notifyCheckpointAborted
in interface CheckpointListener
checkpointId
- The ID of the checkpoint that has been aborted.void resetToCheckpoint(long checkpointId, @Nullable byte[] checkpointData) throws Exception
This method is called in the case of a global failover of the system, which means a
failover of the coordinator (JobManager). This method is not invoked on a partial
failover; partial failovers call the subtaskReset(int, long)
method for the
involved subtasks.
This method is expected to behave synchronously with respect to other method calls and
calls to Context
methods. For example, Events being sent by the Coordinator after
this method returns are assumed to take place after the checkpoint that was restored.
This method is called with a null state argument in the following situations:
Restoring to a checkpoint is a way of confirming that the checkpoint is complete. It is safe to commit side-effects that are predicated on checkpoint completion after this call.
Even if no call to notifyCheckpointComplete(long)
happened, the checkpoint can
still be complete (for example when a system failure happened directly after committing the
checkpoint, before calling the notifyCheckpointComplete(long)
method).
Exception
void subtaskFailed(int subtask, @Nullable Throwable reason)
This method is called every time there is a failover of a subtasks, regardless of whether there it is a partial failover or a global failover.
void subtaskReset(int subtask, long checkpointId)
In contrast to this method, the resetToCheckpoint(long, byte[])
method is called
in the case of a global failover, which is the case when the coordinator (JobManager) is
recovered.
Copyright © 2014–2021 The Apache Software Foundation. All rights reserved.