Overview

Overview #

Flink Kubernetes Operator acts as a control plane to manage the complete deployment lifecycle of Apache Flink applications. Although Flink’s native Kubernetes integration already allows you to directly deploy Flink applications on a running Kubernetes(k8s) cluster, custom resources and the operator pattern have also become central to a Kubernetes native deployment experience.

Flink Kubernetes Operator aims to capture the responsibilities of a human operator who is managing Flink deployments. Human operators have deep knowledge of how Flink deployments ought to behave, how to start clusters, how to deploy jobs, how to upgrade them and how to react if there are problems. The main goal of the operator is the automation of these activities, which cannot be achieved through the Flink native integration alone.

Features #

Core #

  • Fully-automated Job Lifecycle Management
    • Running, suspending and deleting applications
    • Stateful and stateless application upgrades
    • Triggering and managing savepoints
    • Handling errors, rolling-back broken upgrades
  • Multiple Flink version support: v1.13, v1.14, v1.15
  • Deployment Modes:
    • Application cluster
    • Session cluster
    • Session job
  • Built-in High Availability
  • Extensible validation framework
  • Advanced Configuration management
    • Default configurations with dynamic updates
    • Per job configuration
    • Environment variables
  • POD augmentation via Pod Templates
    • Native Kubernetes POD definitions
    • Layering (Base/JobManager/TaskManager overrides)

Operations #

Known Issues & Limitations #

JobManager High-availability #

The Operator leverages Kubernetes HA Services for providing High-availability for Flink jobs. The HA solution can benefit form using additional Standby replicas, it will result in a faster recovery time, but Flink jobs will still restart when the Leader JobManager goes down.

Standalone Kubernetes Support #

The Operator does not support Standalone Kubernetes deployments yet

JobResultStore Resource Leak #

To mitigate the impact of FLINK-27569 the operator introduced a workaround FLINK-27573 by setting job-result-store.delete-on-commit=false and a unique value for job-result-store.storage-path for every cluster launch. The storage path for older runs must be cleaned up manually, keeping the latest directory always:

ls -lth /tmp/flink/ha/job-result-store/basic-checkpoint-ha-example/
total 0
drwxr-xr-x 2 9999 9999 40 May 12 09:51 119e0203-c3a9-4121-9a60-d58839576f01 <- must be retained
drwxr-xr-x 2 9999 9999 60 May 12 09:46 a6031ec7-ab3e-4b30-ba77-6498e58e6b7f
drwxr-xr-x 2 9999 9999 60 May 11 15:11 b6fb2a9c-d1cd-4e65-a9a1-e825c4b47543