Operator Health Monitoring

Operator Health Monitoring #

Health Probe #

The Flink Kubernetes Operator provides a built in health endpoint that serves as the information source for Kubernetes liveness and startup probes.

The liveness and startup probes are enabled by default in the Helm chart:

  port: 8085
    periodSeconds: 10
    initialDelaySeconds: 30
    failureThreshold: 30
    periodSeconds: 10

The health endpoint catches startup and informer errors that are exposed by the JOSDK framework. By default if one of the watched namespaces becomes inaccessible the health endpoint will report an error and the operator will restart.

In some cases it is desirable to keep the operator running even if some namespaces are inaccessible. To allow the operator to start even if some namespaces cannot be watched, you can disable the kubernetes.operator.startup.stop-on-informer-error flag.

Canary Resources #

The canary resource feature allows users to deploy special dummy resources (canaries) into selected namespaces. The operator health probe will then monitor that these resources are reconciled in a timely manner. This allows the operator health probe to catch any slowdowns, and other general reconciliation issues not covered otherwise.

Canary deployments are identified by a special label: "flink.apache.org/canary": "true". These resources do not need to define a spec and they will not start any pods or consume other cluster resources and are purely there to assert the operator reconciliation functionality.

Canary FlinkDeployment:

apiVersion: flink.apache.org/v1beta1
kind: FlinkDeployment
  name: canary
    "flink.apache.org/canary": "true"

The default timeout for reconciling the canary resources is 1 minute and it is controlled by kubernetes.operator.health.canary.resource.timeout. If the operator cannot reconcile the canaries within this time limit the operator is marked unhealthy and will be automatically restarted.

Canaries can be deployed into multiple namespaces.