Class CheckpointFailureManager


  • public class CheckpointFailureManager
    extends Object
    The checkpoint failure manager which centralized manage checkpoint failure processing logic.
    • Field Detail

      • UNLIMITED_TOLERABLE_FAILURE_NUMBER

        public static final int UNLIMITED_TOLERABLE_FAILURE_NUMBER
        See Also:
        Constant Field Values
      • EXCEEDED_CHECKPOINT_TOLERABLE_FAILURE_MESSAGE

        public static final String EXCEEDED_CHECKPOINT_TOLERABLE_FAILURE_MESSAGE
        See Also:
        Constant Field Values
    • Method Detail

      • handleCheckpointException

        public void handleCheckpointException​(@Nullable
                                              PendingCheckpoint pendingCheckpoint,
                                              CheckpointProperties checkpointProperties,
                                              CheckpointException exception,
                                              @Nullable
                                              ExecutionAttemptID executionAttemptID,
                                              JobID job,
                                              @Nullable
                                              PendingCheckpointStats pendingCheckpointStats,
                                              CheckpointStatsTracker statsTracker)
        Failures on JM:
        • all checkpoints - go against failure counter.
        • any savepoints - don’t do anything, manual action, the failover will not help anyway.

        Failures on TM:

        • all checkpoints - go against failure counter (failover might help and we want to notify users).
        • sync savepoints - we must always fail, otherwise we risk deadlock when the job cancelation waiting for finishing savepoint which never happens.
        • non sync savepoints - go against failure counter (failover might help solve the problem).
        Parameters:
        pendingCheckpoint - the failed checkpoint if it was initialized already.
        checkpointProperties - the checkpoint properties in order to determinate which handle strategy can be used.
        exception - the checkpoint exception.
        executionAttemptID - the execution attempt id, as a safe guard.
        job - the JobID.
        pendingCheckpointStats - the pending checkpoint statistics.
        statsTracker - the tracker for checkpoint statistics.
      • checkFailureCounter

        public void checkFailureCounter​(CheckpointException exception,
                                        long checkpointId)
      • handleCheckpointSuccess

        public void handleCheckpointSuccess​(long checkpointId)
        Handle checkpoint success.
        Parameters:
        checkpointId - the failed checkpoint id used to count the continuous failure number based on checkpoint id sequence.