Operator Upgrade Process #
This page details the process of upgrading the operator to a new version.
Please check the compatibility page for the complete overview of the backward compatibility guarantees before upgrading to new versions.
Upgrading from the preview/experimentalv1alpha1
release tov1beta1
requires a one time manual process. Please check the related section.
Normal Upgrade Process #
Normally upgrading the operator to a new release or development version consists of the following two steps:
- Upgrading the CRDs
- Upgrading the Helm deployment
We will cover these steps in detail in the next sections.
1. Upgrading the CRD #
The first step of the upgrade process is upgrading the CRDs for FlinkDeployment
and FlinkSessionJob
resources.
This step must be completed manually and is not part of the helm installation logic.
kubectl replace -f helm/flink-kubernetes-operator/crds/flinkdeployments.flink.apache.org-v1.yml
kubectl replace -f helm/flink-kubernetes-operator/crds/flinksessionjobs.flink.apache.org-v1.yml
Please note that we are using the replace
command here which ensures that running deployments are unaffected.
2. Upgrading the Helm deployment #
Once we have the new CRDs versions we can upgrade the Helm deployment.
# Uninstall running Helm deployment
helm uninstall flink-kubernetes-operator
helm install ...
The exact installation command depends on your current environment and settings. Please see the helm page for details.
Upgrading from v1alpha1 -> v1beta1 #
The first stable v1beta1
release introduced some breaking changes on the operator side when upgrading from the preview (v1alpha1
) release.
These changes require a one time manual upgrade process for the running jobs.
Upgrading without existing FlinkDeployments #
In an environment without any FlinkDeployments
you simply need to uninstall the operator and delete the v1alpha1 CRD.
helm uninstall flink-kubernetes-operator
kubectl delete crd flinkdeployments.flink.apache.org
# Now simply reinstall the operator with the new v1beta1 version
Upgrading with existing FlinkDeployments #
The following steps demonstrate the CRD upgrade process from v1alpha1
to v1beta1
in an environment with an existing stateful job with an old v1alpha1
apiVersion. After the CRD upgrade, the job will resumed from the savepoint.
-
Suspend the job and create savepoint:
kubectl patch flinkdeployment/basic-checkpoint-ha-example --type=merge -p '{"spec": {"job": {"state": "suspended", "upgradeMode": "savepoint"}}}'
Verify
deploy/basic-checkpoint-ha-example
has terminated andflinkdeployment/basic-checkpoint-ha-example
has the Last Savepoint Location similar tofile:/flink-data/savepoints/savepoint-000000-aec3dd08e76d/_metadata
. This file will used to restore the job. See stateful and stateless application upgrade for more detail. -
Delete the job:
kubectl delete flinkdeployment/basic-checkpoint-ha-example
-
Uninstall flink-kubernetes-operator helm chart and the CRD with the old
v1alpha1
version:helm uninstall flink-kubernetes-operator kubectl delete crd flinkdeployments.flink.apache.org
-
Reinstall the flink-kubernetes-operator helm chart with the
v1beta1
CRDhelm repo update flink-operator-repo helm install flink-kubernetes-operator flink-operator-repo/flink-kubernetes-operator
Verify the
deploy/flink-kubernetes-operator
log has:2022-04-13 06:09:40,761 i.j.o.Operator [INFO ] Registered reconciler: 'flinkdeploymentcontroller' for resource: 'class org.apache.flink.kubernetes.operator.crd.FlinkDeployment' for namespace(s): [all namespaces] 2022-04-13 06:09:40,943 i.f.k.c.i.VersionUsageUtils [WARN ] The client is using resource type 'flinksessionjobs' with unstable version 'v1beta1' 2022-04-13 06:09:41,461 i.j.o.Operator [INFO ] Registered reconciler: 'flinksessionjobcontroller' for resource: 'class org.apache.flink.kubernetes.operator.crd.FlinkSessionJob' for namespace(s): [all namespaces] 2022-04-13 06:09:41,464 i.j.o.Operator [INFO ] Operator SDK 2.1.2 (commit: a3a81ef) built on 2022-03-15T09:59:42.000+0000 starting... 2022-04-13 06:09:41,464 i.j.o.Operator [INFO ] Client version: 5.12.1 2022-04-13 06:09:41,499 i.f.k.c.i.VersionUsageUtils [WARN ] The client is using resource type 'flinkdeployments' with unstable version 'v1beta1'
-
Restore the job:
Deploy the previously deleted job using this FlinkDeployemnt with
v1beta1
and explicitly set thejob.initialSavepointPath
to the savepoint location obtained from the step 1.spec: ... job: initialSavepointPath: /flink-data/savepoints/savepoint-000000-aec3dd08e76d/_metadata ...
Alternatively, we may use this command to edit and deploy the manifest:
wget -qO - https://raw.githubusercontent.com/apache/flink-kubernetes-operator/main/examples/basic-checkpoint-ha.yaml| yq w - "spec.job.initialSavepointPath" "/flink-data/savepoints/savepoint-000000-aec3dd08e76d/_metadata"| kubectl apply -f -
Finally, verify that
deploy/basic-checkpoint-ha-example
log has:Starting job 00000000000000000000000000000000 from savepoint /flink-data/savepoints/savepoint-000000-2f40a9c8e4b9/_metadat
Changes of default values of FlinkDeployment #
There are some changes or improvement of default values in the fields of the FlinkDeployment in v1beta1
:
- Default value of
crd.spec.Resource#cpu
is1.0
. - Default value of
crd.spec.JobManagerSpec#replicas
is1
. - No default value of
crd.spec.FlinkDeploymentSpec#serviceAccount
and users must specify its value explicitly.