Overview #
Flink Kubernetes Operator acts as a control plane to manage the complete deployment lifecycle of Apache Flink applications. Although Flink’s native Kubernetes integration already allows you to directly deploy Flink applications on a running Kubernetes(k8s) cluster, custom resources and the operator pattern have also become central to a Kubernetes native deployment experience.
Flink Kubernetes Operator aims to capture the responsibilities of a human operator who is managing Flink deployments. Human operators have deep knowledge of how Flink deployments ought to behave, how to start clusters, how to deploy jobs, how to upgrade them and how to react if there are problems. The main goal of the operator is the automation of these activities, which cannot be achieved through the Flink native integration alone.
Features #
Core #
- Fully-automated Job Lifecycle Management
- Running, suspending and deleting applications
- Stateful and stateless application upgrades
- Triggering and managing savepoints
- Handling errors, rolling-back broken upgrades
- Multiple Flink version support: v1.13, v1.14, v1.15
- Deployment Modes:
- Application cluster
- Session cluster
- Session job
- Built-in High Availability
- Extensible validation framework
- Webhook and Operator based validation
- Custom validators
- Advanced Configuration management
- Default configurations with dynamic updates
- Per job configuration
- Environment variables
- POD augmentation via Pod Templates
- Native Kubernetes POD definitions
- Layering (Base/JobManager/TaskManager overrides)
Operations #
- Operator Metrics
- Utilizes the well-established Flink Metric System
- Pluggable metrics reporters
- Fully-customizable Logging
- Default log configuration
- Per job log configuration
- Sidecar based log forwarders
- Flink Web UI and REST Endpoint Access
- Fully supported Flink Native Kubernetes service expose types
- Dynamic Ingress templates
- Helm based installation
- Automated RBAC configuration
- Advanced customization techniques
- Up-to-date public repositories
Known Issues & Limitations #
JobManager High-availability #
The Operator leverages Kubernetes HA Services for providing High-availability for Flink jobs. The HA solution can benefit form using additional Standby replicas, it will result in a faster recovery time, but Flink jobs will still restart when the Leader JobManager goes down.
Standalone Kubernetes Support #
The Operator does not support Standalone Kubernetes deployments yet
JobResultStore Resource Leak #
To mitigate the impact of FLINK-27569 the operator introduced a workaround FLINK-27573 by setting job-result-store.delete-on-commit=false
and a unique value for job-result-store.storage-path
for every cluster launch. The storage path for older runs must be cleaned up manually, keeping the latest directory always:
ls -lth /tmp/flink/ha/job-result-store/basic-checkpoint-ha-example/
total 0
drwxr-xr-x 2 9999 9999 40 May 12 09:51 119e0203-c3a9-4121-9a60-d58839576f01 <- must be retained
drwxr-xr-x 2 9999 9999 60 May 12 09:46 a6031ec7-ab3e-4b30-ba77-6498e58e6b7f
drwxr-xr-x 2 9999 9999 60 May 11 15:11 b6fb2a9c-d1cd-4e65-a9a1-e825c4b47543