Overview #

Flink Kubernetes Operator acts as a control plane to manage the complete deployment lifecycle of Apache Flink applications. Although Flink’s native Kubernetes integration already allows you to directly deploy Flink applications on a running Kubernetes(k8s) cluster, custom resources and the operator pattern have also become central to a Kubernetes native deployment experience.

Flink Kubernetes Operator aims to capture the responsibilities of a human operator who is managing Flink deployments. Human operators have deep knowledge of how Flink deployments ought to behave, how to start clusters, how to deploy jobs, how to upgrade them and how to react if there are problems. The main goal of the operator is the automation of these activities, which cannot be achieved through the Flink native integration alone.

Features #

Core #

Fully-automated Job Lifecycle Management
- Running, suspending and deleting applications
- Stateful and stateless application upgrades
- Triggering and managing savepoints
- Handling errors, rolling-back broken upgrades
Multiple Flink version support: v1.16, v1.17, v1.18, v1.19, v1.20
Deployment Modes:
- Application cluster
- Session cluster
- Session job
Built-in High Availability
Extensible framework
- Custom validators
- Custom resource listeners
Advanced Configuration management
- Default configurations with dynamic updates
- Per job configuration
- Environment variables
POD augmentation via Pod Templates
- Native Kubernetes POD definitions
- Layering (Base/JobManager/TaskManager overrides)
Job Autoscaler
- Collect lag and utilization metrics
- Scale job vertices to the ideal parallelism
- Scale up and down as the load changes
Snapshot management
- Manage snapshots via Kubernetes CRs

Operations #

Operator Metrics
- Utilizes the well-established Flink Metric System
- Pluggable metrics reporters
- Detailed resources and kubernetes api access metrics
Fully-customizable Logging
- Default log configuration
- Per job log configuration
- Sidecar based log forwarders
Flink Web UI and REST Endpoint Access
- Fully supported Flink Native Kubernetes service expose types
- Dynamic Ingress templates
Helm based installation
- Automated RBAC configuration
- Advanced customization techniques
Up-to-date public repositories
- GitHub Container Registry ghcr.io/apache/flink-kubernetes-operator
- DockerHub https://hub.docker.com/r/apache/flink-kubernetes-operator

Built-in Examples #

The operator project comes with a wide variety of built in examples to show you how to use the operator functionality. The examples are maintained as part of the operator repo and can be found here.

What is covered:

Application, Session and SessionJob submission
Checkpointing and HA configuration
Java, SQL and Python Flink jobs
Ingress, logging and metrics configuration
Advanced operator deployment techniques using Kustomize
And some more…

Known Issues & Limitations #

JobManager High-availability #

The Operator supports both Kubernetes HA Services and Zookeeper HA Services for providing High-availability for Flink jobs. The HA solution can benefit form using additional Standby replicas, it will result in a faster recovery time, but Flink jobs will still restart when the Leader JobManager goes down.

JobResultStore Resource Leak #

To mitigate the impact of FLINK-27569 the operator introduced a workaround FLINK-27573 by setting job-result-store.delete-on-commit=false and a unique value for job-result-store.storage-path for every cluster launch. The storage path for older runs must be cleaned up manually, keeping the latest directory always:

ls -lth /tmp/flink/ha/job-result-store/basic-checkpoint-ha-example/
total 0
drwxr-xr-x 2 9999 9999 40 May 12 09:51 119e0203-c3a9-4121-9a60-d58839576f01 <- must be retained
drwxr-xr-x 2 9999 9999 60 May 12 09:46 a6031ec7-ab3e-4b30-ba77-6498e58e6b7f
drwxr-xr-x 2 9999 9999 60 May 11 15:11 b6fb2a9c-d1cd-4e65-a9a1-e825c4b47543

AuditUtils can log sensitive information present in the custom resources #

As reported in FLINK-30306 when Flink custom resources change the operator logs the change, which could include sensitive information. We suggest ingesting secrets to Flink containers during runtime to mitigate this. Also note that anyone who has access to the custom resources already had access to the potentially sensitive information in question, but folks who only have access to the logs could also see them now. We are planning to introduce redaction rules to AuditUtils to improve this in a later release.