Issue
During deployment or upgrade of Fusion 5.9.12 and above in a Kubernetes environment with multiple namespaces on a shared cluster, the process fails with errors indicating that the KubeRay Operator CustomResourceDefinition (CRD) and related ClusterRoles already exist and are owned by another namespace release. Example errors may include:
CustomResourceDefinition "rayclusters.ray.io" ... exists and cannot be imported into the current release: invalid ownership metadataClusterRole "rayjob-editor-role" ... exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-name" ...-
Post-deployment, the
kuberay-operatorpod fails with RBAC permission errors such as:pods is forbidden: User "system:serviceaccount:<namespace>:kuberay-operator" cannot list resource "pods" in API group "" at the cluster scope
This issue arises when deploying or upgrading another Fusion instance in the same Kubernetes cluster where a previous instance was already successfully deployed or upgraded with KubeRay Operator, and the required CRDs or ClusterRoles already exist from that earlier deployment.
Environment
Fusion 5.9.12 and above
Multi-namespace deployment on a shared cluster
-
Note: The
kuberay-operatorpod is introduced in Fusion 5.9.12 and is not present in earlier versions.
Please refer to this link for more information on model hosting with Ray.
Resolution
1. Review the purpose of the pod and determine whether it is necessary.
-
If you have not deployed a model with Ray, then disable the
kuberay-operatorpod after adding or modifying the following block in yourfusion_values.yamlfile:kuberay-operator: enabled: false -
If you want to enable the
kuberay-operatorpod:kuberay-operator: image: imagePullPolicy: IfNotPresent imagePullSecrets: [] repository: quay.io/kuberay/operator crd: create: false logstashEnabled: false singleNamespaceInstall: true Ensure the modified values/block file is saved if you are working locally and as per your environment.
-
After updating the above block in your
fusion_values.yamlfile, run the upgrade script :./gke_..._upgrade_fusion.sh
Update the script as per your environment.
After the upgrade completes, the
kuberay-operatorpod will either appear or remain absent in the namespace, depending on your above configuration.
2. Correct KubeRay Operator RBAC permissions
If the kuberay-operator pod fails with above mentioned permission errors after deployment, update the ClusterRole and ClusterRoleBinding to grant the necessary permissions:
-
Save and edit the current yaml of ClusterRole - kuberay-operator and its ClusterRoleBinding. You can edit and apply the following YAML directly as follows:
kubectl get clusterrole kuberay-operator -o yaml > x.yaml kubectl get clusterrolebinding kuberay-operator -o yaml > y.yaml -
Ensure the ClusterRole has rules similar to the following:
apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: kuberay-operator rules: - apiGroups: ["ray.io"] resources: ["rayclusters","rayjobs","rayservices"] verbs: ["get","list","watch","create","update","patch","delete"] - apiGroups: ["batch"] resources: ["jobs"] verbs: ["get","list","watch","create","update","patch","delete"] - apiGroups: [""] resources: ["pods", "services"] verbs: ["get","list","watch","create","update","patch","delete"] - apiGroups: ["networking.k8s.io"] resources: ["ingresses"] verbs: ["get","list","watch","create","update","patch","delete"] -
Update the ClusterRoleBinding to reference the correct service account and namespace:
apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: kuberay-operator roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: kuberay-operator subjects: - kind: ServiceAccount name: kuberay-operator namespace: your-namespace -
Apply the changes:
kubectl apply -f x.yaml kubectl apply -f y.yaml -
Restart the
kuberay-operatorpod to pick up the new configuration:kubectl delete pod kuberay-operator -n <your-namespace>