OperatorCoordinatorHolder (flink 1.11-SNAPSHOT API)

java.lang.Object
- org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder

All Implemented Interfaces:

AutoCloseable, OperatorCoordinatorCheckpointContext, OperatorCoordinator, OperatorInfo, CheckpointListener
```
public class OperatorCoordinatorHolder
extends Object
implements OperatorCoordinator, OperatorCoordinatorCheckpointContext
```
The OperatorCoordinatorHolder holds the OperatorCoordinator and manages all its interactions with the remaining components. It provides the context and is responsible for checkpointing and exactly once semantics.
Exactly-one Semantics

The semantics are described under OperatorCoordinator.checkpointCoordinator(long, CompletableFuture).
Exactly-one Mechanism

This implementation can handle one checkpoint being triggered at a time. If another checkpoint is triggered while the triggering of the first one was not completed or aborted, this class will throw an exception. That is in line with the capabilities of the Checkpoint Coordinator, which can handle multiple concurrent checkpoints on the TaskManagers, but only one concurrent triggering phase.
The mechanism for exactly once semantics is as follows:
- Events pass through a special channel, the OperatorEventValve. If we are not currently triggering a checkpoint, then events simply pass through.
- Atomically, with the completion of the checkpoint future for the coordinator, this operator operator event valve is closed. Events coming after that are held back (buffered), because they belong to the epoch after the checkpoint.
- Once all coordinators in the job have completed the checkpoint, the barriers to the sources are injected. After that (see afterSourceBarrierInjection(long)) the valves are opened again and the events are sent.
- If a task fails in the meantime, the events are dropped from the valve. From the coordinator's perspective, these events are lost, because they were sent to a failed subtask after it's latest complete checkpoint.
IMPORTANT: A critical assumption is that all events from the scheduler to the Tasks are transported strictly in order. Events being sent from the coordinator after the checkpoint barrier was injected must not overtake the checkpoint barrier. This is currently guaranteed by Flink's RPC mechanism.
Consider this example:
```
 Coordinator one events: => a . . b . |trigger| . . |complete| . . c . . d . |barrier| . e . f
 Coordinator two events: => . . x . . |trigger| . . . . . . . . . .|complete||barrier| . . y . . z
 
```
Two coordinators trigger checkpoints at the same time. 'Coordinator Two' takes longer to complete, and in the meantime 'Coordinator One' sends more events.
'Coordinator One' emits events 'c' and 'd' after it finished its checkpoint, meaning the events must take place after the checkpoint. But they are before the barrier injection, meaning the runtime task would see them before the checkpoint, if they were immediately transported.
'Coordinator One' closes its valve as soon as the checkpoint future completes. Events 'c' and 'd' get held back in the valve. Once 'Coordinator Two' completes its checkpoint, the barriers are sent to the sources. Then the valves are opened, and events 'c' and 'd' can flow to the tasks where they are received after the barrier.
Concurrency and Threading Model

This component runs mainly in a main-thread-executor, like RPC endpoints. However, some actions need to be triggered synchronously by other threads. Most notably, when the checkpoint future is completed by the OperatorCoordinator implementation, we need to synchronously suspend event-sending.

Nested Class Summary
- Nested classes/interfaces inherited from interface org.apache.flink.runtime.operators.coordination.OperatorCoordinator
  OperatorCoordinator.Context, OperatorCoordinator.Provider

Field Summary
- Fields inherited from interface org.apache.flink.runtime.operators.coordination.OperatorCoordinator
  NO_CHECKPOINT

Method Summary

All Methods Static Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`abortCurrentTriggering()`
`void`	`afterSourceBarrierInjection(long checkpointId)`
`void`	`checkpointCoordinator(long checkpointId, CompletableFuture<byte[]> result)` Takes a checkpoint of the coordinator.
`void`	`close()` This method is called when the coordinator is disposed.
`OperatorCoordinator`	`coordinator()`
`static OperatorCoordinatorHolder`	`create(SerializedValue<OperatorCoordinator.Provider> serializedProvider, ExecutionJobVertex jobVertex, ClassLoader classLoader)`
`int`	`currentParallelism()`
`void`	`handleEventFromOperator(int subtask, OperatorEvent event)` Hands an OperatorEvent from a task (on the Task Manager) to this coordinator.
`void`	`lazyInitialize(SchedulerNG scheduler, ComponentMainThreadExecutor mainThreadExecutor)`
`int`	`maxParallelism()`
`void`	`notifyCheckpointAborted(long checkpointId)` We override the method here to remove the checked exception.
`void`	`notifyCheckpointComplete(long checkpointId)` We override the method here to remove the checked exception.
`OperatorID`	`operatorId()`
`void`	`resetToCheckpoint(long checkpointId, byte[] checkpointData)` Resets the coordinator to the given checkpoint.
`void`	`start()` Starts the coordinator.
`void`	`subtaskFailed(int subtask, Throwable reason)` Called when one of the subtasks of the task running the coordinated operator goes through a failover (failure / recovery cycle).
`void`	`subtaskReset(int subtask, long checkpointId)` Called if a task is recovered as part of a partial failover, meaning a failover handled by the scheduler's failover strategy (by default recovering a pipelined region).

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface org.apache.flink.runtime.operators.coordination.OperatorInfo
getIds

- Method Detail
  - lazyInitialize
```
public void lazyInitialize(SchedulerNG scheduler,
                           ComponentMainThreadExecutor mainThreadExecutor)
```
  - coordinator
```
public OperatorCoordinator coordinator()
```
  - operatorId
```
public OperatorID operatorId()
```
    Specified by:
    
    operatorId in interface OperatorInfo
  - maxParallelism
```
public int maxParallelism()
```
    Specified by:
    
    maxParallelism in interface OperatorInfo
  - currentParallelism
```
public int currentParallelism()
```
    Specified by:
    
    currentParallelism in interface OperatorInfo
  - start
```
public void start()
           throws Exception
```
    Description copied from interface: OperatorCoordinator
    
    Starts the coordinator. This method is called once at the beginning, before any other methods.
    
    Specified by:
    
    start in interface OperatorCoordinator
    
    Throws:
    
    Exception - Any exception thrown from this method causes a full job failure.
  - close
```
public void close()
           throws Exception
```
    Description copied from interface: OperatorCoordinator
    
    This method is called when the coordinator is disposed. This method should release currently held resources. Exceptions in this method do not cause the job to fail.
    
    Specified by:
    
    close in interface AutoCloseable
    
    Specified by:
    
    close in interface OperatorCoordinator
    
    Throws:
    
    Exception
  - handleEventFromOperator
```
public void handleEventFromOperator(int subtask,
                                    OperatorEvent event)
                             throws Exception
```
    Description copied from interface: OperatorCoordinator
    
    Hands an OperatorEvent from a task (on the Task Manager) to this coordinator.
    
    Specified by:
    
    handleEventFromOperator in interface OperatorCoordinator
    
    Throws:
    
    Exception - Any exception thrown by this method results in a full job failure and recovery.
  - subtaskFailed
```
public void subtaskFailed(int subtask,
                          @Nullable
                          Throwable reason)
```
    Description copied from interface: OperatorCoordinator
    
    Called when one of the subtasks of the task running the coordinated operator goes through a failover (failure / recovery cycle).
    This method is called every time there is a failover of a subtasks, regardless of whether there it is a partial failover or a global failover.
    
    Specified by:
    
    subtaskFailed in interface OperatorCoordinator
  - subtaskReset
```
public void subtaskReset(int subtask,
                         long checkpointId)
```
    Description copied from interface: OperatorCoordinator
    
    Called if a task is recovered as part of a partial failover, meaning a failover handled by the scheduler's failover strategy (by default recovering a pipelined region). The method is invoked for each subtask involved in that partial failover.
    In contrast to this method, the OperatorCoordinator.resetToCheckpoint(long, byte[]) method is called in the case of a global failover, which is the case when the coordinator (JobManager) is recovered.
    
    Specified by:
    
    subtaskReset in interface OperatorCoordinatorCheckpointContext
    
    Specified by:
    
    subtaskReset in interface OperatorCoordinator
  - checkpointCoordinator
```
public void checkpointCoordinator(long checkpointId,
                                  CompletableFuture<byte[]> result)
```
    Description copied from interface: OperatorCoordinator
    Takes a checkpoint of the coordinator. The checkpoint is identified by the given ID.
    To confirm the checkpoint and store state in it, the given CompletableFuture must be completed with the state. To abort or dis-confirm the checkpoint, the given CompletableFuture must be completed exceptionally. In any case, the given CompletableFuture must be completed in some way, otherwise the checkpoint will not progress.
    Exactly-once Semantics
    
    The semantics are defined as follows:
    - The point in time when the checkpoint future is completed is considered the point in time when the coordinator's checkpoint takes place.
    - The OperatorCoordinator implementation must have a way of strictly ordering the sending of events and the completion of the checkpoint future (for example the same thread does both actions, or both actions are guarded by a mutex).
    - Every event sent before the checkpoint future is completed is considered before the checkpoint.
    - Every event sent after the checkpoint future is completed is considered to be after the checkpoint.
    Specified by:
    
    checkpointCoordinator in interface OperatorCoordinatorCheckpointContext
    
    Specified by:
    
    checkpointCoordinator in interface OperatorCoordinator
  - notifyCheckpointComplete
```
public void notifyCheckpointComplete(long checkpointId)
```
    Description copied from interface: OperatorCoordinator
    
    We override the method here to remove the checked exception. Please check the Java docs of CheckpointListener.notifyCheckpointComplete(long) for more detail semantic of the method.
    
    Specified by:
    
    notifyCheckpointComplete in interface OperatorCoordinatorCheckpointContext
    
    Specified by:
    
    notifyCheckpointComplete in interface OperatorCoordinator
    
    Specified by:
    
    notifyCheckpointComplete in interface CheckpointListener
    
    Parameters:
    
    checkpointId - The ID of the checkpoint that has been completed.
  - notifyCheckpointAborted
```
public void notifyCheckpointAborted(long checkpointId)
```
    Description copied from interface: OperatorCoordinator
    
    We override the method here to remove the checked exception. Please check the Java docs of CheckpointListener.notifyCheckpointAborted(long) for more detail semantic of the method.
    
    Specified by:
    
    notifyCheckpointAborted in interface OperatorCoordinatorCheckpointContext
    
    Specified by:
    
    notifyCheckpointAborted in interface OperatorCoordinator
    
    Specified by:
    
    notifyCheckpointAborted in interface CheckpointListener
    
    Parameters:
    
    checkpointId - The ID of the checkpoint that has been aborted.
  - resetToCheckpoint
```
public void resetToCheckpoint(long checkpointId,
                              @Nullable
                              byte[] checkpointData)
                       throws Exception
```
    Description copied from interface: OperatorCoordinator
    Resets the coordinator to the given checkpoint. When this method is called, the coordinator can discard all other in-flight working state. All subtasks will also have been reset to the same checkpoint.
    This method is called in the case of a global failover of the system, which means a failover of the coordinator (JobManager). This method is not invoked on a partial failover; partial failovers call the OperatorCoordinator.subtaskReset(int, long) method for the involved subtasks.
    This method is expected to behave synchronously with respect to other method calls and calls to Context methods. For example, Events being sent by the Coordinator after this method returns are assumed to take place after the checkpoint that was restored.
    This method is called with a null state argument in the following situations:
    - There is a recovery and there was no completed checkpoint yet.
    - There is a recovery from a completed checkpoint/savepoint but it contained no state for the coordinator.
    In both cases, the coordinator should reset to an empty (new) state.
    Restoring implicitly notifies of Checkpoint Completion
    
    Restoring to a checkpoint is a way of confirming that the checkpoint is complete. It is safe to commit side-effects that are predicated on checkpoint completion after this call.
    Even if no call to OperatorCoordinator.notifyCheckpointComplete(long) happened, the checkpoint can still be complete (for example when a system failure happened directly after committing the checkpoint, before calling the OperatorCoordinator.notifyCheckpointComplete(long) method).
    Specified by:
    
    resetToCheckpoint in interface OperatorCoordinatorCheckpointContext
    
    Specified by:
    
    resetToCheckpoint in interface OperatorCoordinator
    
    Throws:
    
    Exception
  - afterSourceBarrierInjection
```
public void afterSourceBarrierInjection(long checkpointId)
```
    Specified by:
    
    afterSourceBarrierInjection in interface OperatorCoordinatorCheckpointContext
  - abortCurrentTriggering
```
public void abortCurrentTriggering()
```
    Specified by:
    
    abortCurrentTriggering in interface OperatorCoordinatorCheckpointContext
  - create
```
public static OperatorCoordinatorHolder create(SerializedValue<OperatorCoordinator.Provider> serializedProvider,
                                               ExecutionJobVertex jobVertex,
                                               ClassLoader classLoader)
                                        throws Exception
```
    Throws:
    
    Exception

Back to Flink Website

Class OperatorCoordinatorHolder

Exactly-one Semantics

Exactly-one Mechanism

Concurrency and Threading Model

Nested Class Summary

Nested classes/interfaces inherited from interface org.apache.flink.runtime.operators.coordination.OperatorCoordinator

Field Summary

Fields inherited from interface org.apache.flink.runtime.operators.coordination.OperatorCoordinator

Method Summary

Methods inherited from class java.lang.Object

Methods inherited from interface org.apache.flink.runtime.operators.coordination.OperatorInfo

Method Detail

lazyInitialize

coordinator

operatorId

maxParallelism

currentParallelism

start

close

handleEventFromOperator

subtaskFailed

subtaskReset

checkpointCoordinator

Exactly-once Semantics

notifyCheckpointComplete

notifyCheckpointAborted

resetToCheckpoint

Restoring implicitly notifies of Checkpoint Completion

afterSourceBarrierInjection

abortCurrentTriggering

create