public class Execution extends Object implements AccessExecution, Archiveable<ArchivedExecution>, LogicalSlot.Payload
ExecutionVertex
can be executed multiple times
(for recovery, re-computation, re-configuration), this class tracks the state of a single
execution of that vertex and the resources.
In several points of the code, we need to deal with possible concurrent state changes and actions. For example, while the call to deploy a task (send it to the TaskManager) happens, the task gets cancelled.
We could lock the entire portion of the code (decision to deploy, deploy, set state to running) such that it is guaranteed that any "cancel command" will only pick up after deployment is done and that the "cancel command" call will never overtake the deploying call.
This blocks the threads big time, because the remote calls may take long. Depending of their locking behavior, it may even result in distributed deadlocks (unless carefully avoided). We therefore use atomic state updates and occasional double-checking to ensure that the state after a completed call is as expected, and trigger correcting actions if it is not. Many actions are also idempotent (like canceling).
Constructor and Description |
---|
Execution(Executor executor,
ExecutionVertex vertex,
int attemptNumber,
long startTimestamp,
Duration rpcTimeout)
Creates a new Execution attempt.
|
Modifier and Type | Method and Description |
---|---|
ArchivedExecution |
archive() |
void |
cancel() |
static ResultPartitionDeploymentDescriptor |
createResultPartitionDeploymentDescriptor(IntermediateResultPartition partition,
ShuffleDescriptor shuffleDescriptor) |
void |
deploy()
Deploys the execution to the previously assigned resource.
|
void |
fail(Throwable t)
This method fails the vertex due to an external condition.
|
AllocationID |
getAssignedAllocationID() |
LogicalSlot |
getAssignedResource() |
TaskManagerLocation |
getAssignedResourceLocation()
Returns the
TaskManagerLocation for this execution. |
ExecutionAttemptID |
getAttemptId()
Returns the
ExecutionAttemptID for this Execution. |
int |
getAttemptNumber()
Returns the attempt number for this execution.
|
Optional<ErrorInfo> |
getFailureInfo()
Returns the exception that caused the job to fail.
|
CompletableFuture<?> |
getInitializingOrRunningFuture()
Gets a future that completes once the task execution reaches one of the states
ExecutionState.INITIALIZING or ExecutionState.RUNNING . |
IOMetrics |
getIOMetrics() |
Optional<InputSplit> |
getNextInputSplit() |
int |
getParallelSubtaskIndex()
Returns the subtask index of this execution.
|
CompletableFuture<?> |
getReleaseFuture()
Gets the release future which is completed once the execution reaches a terminal state and
the assigned resource has been released.
|
Optional<ResultPartitionDeploymentDescriptor> |
getResultPartitionDeploymentDescriptor(IntermediateResultPartitionID id) |
ExecutionState |
getState()
Returns the current
ExecutionState for this execution. |
long |
getStateEndTimestamp(ExecutionState state)
Returns the end timestamp for the given
ExecutionState . |
long[] |
getStateEndTimestamps()
Returns the end timestamps for every
ExecutionState . |
long |
getStateTimestamp(ExecutionState state)
Returns the timestamp for the given
ExecutionState . |
long[] |
getStateTimestamps()
Returns the timestamps for every
ExecutionState . |
CompletableFuture<TaskManagerLocation> |
getTaskManagerLocationFuture() |
JobManagerTaskRestore |
getTaskRestore() |
CompletableFuture<ExecutionState> |
getTerminalStateFuture()
Gets a future that completes once the task execution reaches a terminal state.
|
Map<String,Accumulator<?,?>> |
getUserAccumulators() |
StringifiedAccumulatorResult[] |
getUserAccumulatorsStringified()
Returns the user-defined accumulators as strings.
|
ExecutionVertex |
getVertex() |
String |
getVertexWithAttempt() |
boolean |
isFinished() |
void |
markFailed(Throwable t)
This method marks the task as failed, but will make no attempt to remove task execution from
the task manager.
|
void |
markFinished() |
void |
notifyCheckpointAborted(long abortCheckpointId,
long latestCompletedCheckpointId,
long timestamp)
Notify the task of this execution about a aborted checkpoint.
|
void |
notifyCheckpointOnComplete(long completedCheckpointId,
long completedTimestamp,
long lastSubsumedCheckpointId)
Notify the task of this execution about a completed checkpoint and the last subsumed
checkpoint id if possible.
|
void |
recoverExecution(ExecutionAttemptID attemptId,
TaskManagerLocation location,
Map<String,Accumulator<?,?>> userAccumulators,
IOMetrics metrics)
Recover the execution attempt status after JM failover.
|
void |
recoverProducedPartitions(Map<IntermediateResultPartitionID,ResultPartitionDeploymentDescriptor> producedPartitions) |
CompletableFuture<Void> |
registerProducedPartitions(TaskManagerLocation location) |
CompletableFuture<Acknowledge> |
sendOperatorEvent(OperatorID operatorId,
SerializedValue<OperatorEvent> event)
Sends the operator event to the Task on the Task Executor.
|
void |
setAccumulators(Map<String,Accumulator<?,?>> userAccumulators)
Update accumulators (discarded when the Execution has already been terminated).
|
void |
setInitialState(JobManagerTaskRestore taskRestore)
Sets the initial state for the execution.
|
CompletableFuture<?> |
suspend() |
String |
toString() |
void |
transitionState(ExecutionState targetState) |
CompletableFuture<Acknowledge> |
triggerCheckpoint(long checkpointId,
long timestamp,
CheckpointOptions checkpointOptions)
Trigger a new checkpoint on the task of this execution.
|
CompletableFuture<Acknowledge> |
triggerSynchronousSavepoint(long checkpointId,
long timestamp,
CheckpointOptions checkpointOptions)
Trigger a new checkpoint on the task of this execution.
|
boolean |
tryAssignResource(LogicalSlot logicalSlot)
Tries to assign the given slot to the execution.
|
public Execution(Executor executor, ExecutionVertex vertex, int attemptNumber, long startTimestamp, Duration rpcTimeout)
executor
- The executor used to dispatch callbacks from futures and asynchronous RPC
calls.vertex
- The execution vertex to which this Execution belongsattemptNumber
- The execution attempt number.startTimestamp
- The timestamp that marks the creation of this ExecutionrpcTimeout
- The rpcTimeout for RPC calls like deploy/cancel/stop.public ExecutionVertex getVertex()
public ExecutionAttemptID getAttemptId()
AccessExecution
ExecutionAttemptID
for this Execution.getAttemptId
in interface AccessExecution
public int getAttemptNumber()
AccessExecution
getAttemptNumber
in interface AccessExecution
public ExecutionState getState()
AccessExecution
ExecutionState
for this execution.getState
in interface AccessExecution
@Nullable public AllocationID getAssignedAllocationID()
public CompletableFuture<TaskManagerLocation> getTaskManagerLocationFuture()
public LogicalSlot getAssignedResource()
public Optional<ResultPartitionDeploymentDescriptor> getResultPartitionDeploymentDescriptor(IntermediateResultPartitionID id)
public boolean tryAssignResource(LogicalSlot logicalSlot)
logicalSlot
- to assign to this executionpublic Optional<InputSplit> getNextInputSplit()
public TaskManagerLocation getAssignedResourceLocation()
AccessExecution
TaskManagerLocation
for this execution.getAssignedResourceLocation
in interface AccessExecution
public Optional<ErrorInfo> getFailureInfo()
AccessExecution
getFailureInfo
in interface AccessExecution
Optional
of ErrorInfo
containing the Throwable
and the
time it was registered if an error occurred. If no error occurred an empty Optional
will be returned.public long[] getStateTimestamps()
AccessExecution
ExecutionState
.getStateTimestamps
in interface AccessExecution
public long[] getStateEndTimestamps()
AccessExecution
ExecutionState
.getStateEndTimestamps
in interface AccessExecution
public long getStateTimestamp(ExecutionState state)
AccessExecution
ExecutionState
.getStateTimestamp
in interface AccessExecution
state
- state for which the timestamp should be returnedpublic long getStateEndTimestamp(ExecutionState state)
AccessExecution
ExecutionState
.getStateEndTimestamp
in interface AccessExecution
state
- state for which the timestamp should be returnedpublic boolean isFinished()
@Nullable public JobManagerTaskRestore getTaskRestore()
public void setInitialState(JobManagerTaskRestore taskRestore)
TaskDeploymentDescriptor
to the TaskManagers.taskRestore
- information to restore the statepublic CompletableFuture<?> getInitializingOrRunningFuture()
ExecutionState.INITIALIZING
or ExecutionState.RUNNING
. If this task never reaches
these states (for example because the task is cancelled before it was properly deployed and
restored), then this future will never complete.
The future is completed already in the ExecutionState.INITIALIZING
state, because
various running actions are already possible in that state (the task already accepts and
sends events and network data for task recovery). (Note that in earlier versions, the
INITIALIZING state was not separate but part of the RUNNING state).
This future is always completed from the job master's main thread.
public CompletableFuture<ExecutionState> getTerminalStateFuture()
getTerminalStateFuture
in interface LogicalSlot.Payload
public CompletableFuture<?> getReleaseFuture()
public CompletableFuture<Void> registerProducedPartitions(TaskManagerLocation location)
public void recoverExecution(ExecutionAttemptID attemptId, TaskManagerLocation location, Map<String,Accumulator<?,?>> userAccumulators, IOMetrics metrics)
public void recoverProducedPartitions(Map<IntermediateResultPartitionID,ResultPartitionDeploymentDescriptor> producedPartitions)
public static ResultPartitionDeploymentDescriptor createResultPartitionDeploymentDescriptor(IntermediateResultPartition partition, ShuffleDescriptor shuffleDescriptor)
public void deploy() throws JobException
JobException
- if the execution cannot be deployed to the assigned resourcepublic void cancel()
public CompletableFuture<?> suspend()
public void fail(Throwable t)
fail
in interface LogicalSlot.Payload
t
- The exception that caused the task to fail.public void notifyCheckpointOnComplete(long completedCheckpointId, long completedTimestamp, long lastSubsumedCheckpointId)
completedCheckpointId
- of the completed checkpointcompletedTimestamp
- of the completed checkpointlastSubsumedCheckpointId
- of the last subsumed checkpoint, a value of CheckpointStoreUtil.INVALID_CHECKPOINT_ID
means no
checkpoint has been subsumed.public void notifyCheckpointAborted(long abortCheckpointId, long latestCompletedCheckpointId, long timestamp)
abortCheckpointId
- of the subsumed checkpointlatestCompletedCheckpointId
- of the latest completed checkpointtimestamp
- of the subsumed checkpointpublic CompletableFuture<Acknowledge> triggerCheckpoint(long checkpointId, long timestamp, CheckpointOptions checkpointOptions)
checkpointId
- of th checkpoint to triggertimestamp
- of the checkpoint to triggercheckpointOptions
- of the checkpoint to triggerpublic CompletableFuture<Acknowledge> triggerSynchronousSavepoint(long checkpointId, long timestamp, CheckpointOptions checkpointOptions)
checkpointId
- of th checkpoint to triggertimestamp
- of the checkpoint to triggercheckpointOptions
- of the checkpoint to triggerpublic CompletableFuture<Acknowledge> sendOperatorEvent(OperatorID operatorId, SerializedValue<OperatorEvent> event)
public void markFailed(Throwable t)
t
- The exception that caused the task to fail.@VisibleForTesting public void markFinished()
public void transitionState(ExecutionState targetState)
public String getVertexWithAttempt()
public void setAccumulators(Map<String,Accumulator<?,?>> userAccumulators)
userAccumulators
- the user accumulatorspublic Map<String,Accumulator<?,?>> getUserAccumulators()
public StringifiedAccumulatorResult[] getUserAccumulatorsStringified()
AccessExecution
getUserAccumulatorsStringified
in interface AccessExecution
public int getParallelSubtaskIndex()
AccessExecution
getParallelSubtaskIndex
in interface AccessExecution
public IOMetrics getIOMetrics()
getIOMetrics
in interface AccessExecution
public ArchivedExecution archive()
archive
in interface Archiveable<ArchivedExecution>
Copyright © 2014–2024 The Apache Software Foundation. All rights reserved.