Troubleshooting - OCI Service Operator for Kubernetes
Identify the causes and fixes for problems with OCI Service Operator for Kubernetes
problems on Service Mesh.
Note
By default, operator-sdk installs OCI Service Operator for
Kubernetes bundle in the 'default' namespace. For most use cases, a namespace is
specified (for example, olm) when OCI Service Operator for
Kubernetes is installed. Therefore, kubectl command might require
the namespace parameter: -n $NAMESPACE.
Install: Verify That the Operator Lifecycle Manager (OLM) Installation Was
Successful
Issue
Successful OLM installation needs verification.
Solution
To verify the successful OLM installation, run the status command:
Copy
## status of olm
$ operator-sdk olm status
INFO[0007] Fetching CRDs for version "0.20.0"
INFO[0007] Fetching resources for resolved version "v0.20.0"
INFO[0031] Successfully got OLM status for version "0.20.0"
NAME NAMESPACE KIND STATUS
operatorgroups.operators.coreos.com CustomResourceDefinition Installed
operatorconditions.operators.coreos.com CustomResourceDefinition Installed
olmconfigs.operators.coreos.com CustomResourceDefinition Installed
installplans.operators.coreos.com CustomResourceDefinition Installed
clusterserviceversions.operators.coreos.com CustomResourceDefinition Installed
olm-operator-binding-olm ClusterRoleBinding Installed
operatorhubio-catalog olm CatalogSource Installed
olm-operators olm OperatorGroup Installed
aggregate-olm-view ClusterRole Installed
catalog-operator olm Deployment Installed
cluster OLMConfig Installed
operators.operators.coreos.com CustomResourceDefinition Installed
olm-operator olm Deployment Installed
subscriptions.operators.coreos.com CustomResourceDefinition Installed
aggregate-olm-edit ClusterRole Installed
olm Namespace Installed
global-operators operators OperatorGroup Installed
operators Namespace Installed
packageserver olm ClusterServiceVersion Installed
olm-operator-serviceaccount olm ServiceAccount Installed
catalogsources.operators.coreos.com CustomResourceDefinition Installed
system:controller:operator-lifecycle-manager ClusterRole Installed
The output shows a list of installed components. Each entry in the STATUS column must be Installed. If any columns aren't listed as Installed perform the following steps.
Uninstall OLM.
Copy
## Uninstall the OLM
$ operator-sdk olm uninstall
Reinstall OLM.
Copy
## Install the OLM
$ operator-sdk olm install --version 0.20.0
Install: OCI Service Operator for Kubernetes OLM Installation Fails with Error 🔗
Issue
After installing OLM, checking the installation status returns an error.
Solution
Steps to reproduce:
After installation, verify the successful OLM installation by running
the status command:
Copy
## status of olm
$ operator-sdk olm status
Error returned:
## FATA[0034] Failed to install OLM version "latest": detected existing OLM resources: OLM must be completely uninstalled before installation
To fix the error, uninstall OLM.
Copy
## Uninstall the OLM
$ operator-sdk olm uninstall
If the command fails, run the status command to get version information:
Copy
## status of olm
$ operator-sdk olm status
Next, try the following options:
Option 1: Run the following command to uninstall OLM using version.
Verify that uninstall was successful with the following commands.
Verify that OLM successfully uninstalled.
Copy
$ kubectl get namespace olm
Error from server (NotFound): namespaces olm not found
Verify that OLM uninstalled successfully by ensuring that OLM owned
CustomResourceDefinitions are removed.
Copy
$ kubectl get crd | grep operators.coreos.com
Check that the OLM deployments are terminated.
Copy
$ kubectl get deploy -n olm
No resources found.
Install: OCI Service Operator for Kubernetes OLM Installation Fails with Timeout
Message 🔗
Issue
OCI Service Operator for Kubernetes OLM installation fails with the following
message.
## FATA[0125] Failed to run bundle upgrade: error waiting for CSV to install: timed out waiting for the condition
Explanation
The error signifies the installer timed out waiting for an installation
condition to complete. The error message might be misleading because, given
enough time, the installer eventually reports
Succeeded.
Solution
To ensure that OCI Service Operator for Kubernetes OLM installation
succeeded, follow these steps.
Verify the status of the CSV.
Copy
$ kubectl get csv
NAME DISPLAY VERSION REPLACES PHASE
oci-service-operator.vX.X.X oci-service-operator X.X.X Succeeded
If the phase does not reach Succeeded, delete the
bundle pod of the OCI Service Operator for Kubernetes version you are
deploying.
Copy
$ kubectl get pods | grep oci-service-operator-bundle
$ kubectl delete pod <POD_FROM_ABOVE_COMMAND>
After the bundle pod is deleted, reinstall the OSOK bundle.
Copy
$ operator-sdk run bundle iad.ocir.io/oracle/oci-service-operator-bundle:X.X.X -n oci-service-operator-system --timeout 5m
## or for Upgrade
$ operator-sdk run bundle-upgrade iad.ocir.io/oracle/oci-service-operator-bundle:X.X.X -n oci-service-operator-system --timeout 5m
Users must be logged into the Oracle Registry at iad.ocir.io in Docker to run the command. To ensure you are logged in, see Pulling Images Using the Docker CLI.
Verify that the OCI Service Operator for Kubernetes is deployed
successfully.
Copy
$ kubectl get deployments | grep "oci-service-operator-controller-manager"
NAME READY UP-TO-DATE AVAILABLE AGE
oci-service-operator-controller-manager 1/1 1 1 2d9h
Install: OSOK Operator Installation Fails with Run Bundle 🔗
Issue
OSOK Operator installation fails with the following message:
FATA[0121] Failed to run bundle: install plan is not available for the subscription
Debug
The error signifies the installation failed to run the operator bundle. The
error is caused one of the following issues: failing to pull the image, a
network issue, or an upstream OPM issue.
List the pods in the operator installation namespace.
Copy
kubectl get pods -n oci-service-operator-system
View the logs of the failing pod in the operator namespace.
Install the OCI Service Operator for Kubernetes Operator in the Kubernetes cluster in your namespace (oci-service-operator-system) using the following command.
Copy
operator-sdk run bundle --index-image quay.io/operator-framework/opm:v1.23.1 iad.ocir.io/oracle/oci-service-operator-bundle:X.X.X -n oci-service-operator-system --timeout 5m
Users must be logged into the Oracle Registry at iad.ocir.io in Docker to run the command. To ensure you're logged in, see Pulling Images Using the Docker CLI.
The command produces output similar to the following:
INFO[0036] Successfully created registry pod: iad-ocir-io-oracle-oci-service-operator-bundle-X-X-X
INFO[0036] Created CatalogSource: oci-service-operator-catalog
INFO[0037] OperatorGroup "operator-sdk-og" created
INFO[0037] Created Subscription: oci-service-operator-vX-X-X-sub
INFO[0040] Approved InstallPlan install-tzk5f for the Subscription: oci-service-operator-vX-X-X-sub
INFO[0040] Waiting for ClusterServiceVersion "oci-service-operator-system/oci-service-operator.vX.X.X" to reach 'Succeeded' phase
INFO[0040] Waiting for ClusterServiceVersion "oci-service-operator-system/oci-service-operator.vX.X.X" to appear
INFO[0048] Found ClusterServiceVersion "oci-service-operator-system/oci-service-operator.vX.X.X" phase: Pending
INFO[0049] Found ClusterServiceVersion "oci-service-operator-system/oci-service-operator.vX.X.X" phase: InstallReady
INFO[0053] Found ClusterServiceVersion "oci-service-operator-system/oci-service-operator.vX.X.X" phase: Installing
INFO[0066] Found ClusterServiceVersion "oci-service-operator-system/oci-service-operator.vX.X.X" phase: Succeeded
INFO[0067] OLM has successfully installed "oci-service-operator.vX.X.X"
Install: OSOK Operator Installation Fails with Authorization Error 🔗
Issue
OSOK Operator installation fails with the following message:
Error: failed to authorize: failed to fetch oauth token: unexpected status: 401 Unauthorized
Sample install command:
Copy
$ operator-sdk run bundle iad.ocir.io/oracle/oci-service-operator-bundle:X.Y.Z -n oci-service-operator-system --timeout 5m
Here's the complete error message.
INFO[0002] trying next host error="failed to authorize: failed to fetch
oauth token: unexpected status: 401 Unauthorized" host=iad.ocir.io
FATA[0002] Failed to run bundle: pull bundle image: error pulling image
iad.ocir.io/oracle/oci-service-operator-bundle:X.X.X: error resolving
name : failed to authorize: failed to fetch oauth token: unexpected
status: 401 Unauthorized
Explanation
This error occurs because Docker isn't logged into the Oracle Registry (iad.ocir.io).
Solution
To refresh the token used for pulling images from OCIR. Run the following
command:
Upgrade: OSOK Operator Upgrade Fails with Run Bundle 🔗
Issue
Problem: OSOK Operator upgrade fails with the following message.
FATA[0307] Failed to run bundle upgrade: error waiting for CSV to install
Explanation
The error signifies the upgrade of the operator bundle failed. The error has
one of the following causes: failing to pull the image, a network issue, or
an upstream OPM issue.
Solution
To make OCI Service Operator for Kubernetes upgrade successful, follow these
steps.
Users must be logged into the Oracle Registry at iad.ocir.io in Docker to run the command. To ensure you are logged in, see Pulling Images Using the Docker CLI.
Cluster: Verify that OCI Service Operator for Kubernetes Pods are Running 🔗
Check pods
Verify that the OSOK pods are running successfully.
If pods not running, verify pod logs for specific issue using the following
command.
Copy
$ kubectl logs pod/<POD_FROM_ABOVE_COMMAND> -f
Cluster: Authorization Failed or Requested Resource Not Found 🔗
Issue
Received an authorization failed or requested resource not found when deploying the
service.
Example
"message": Failed to create or update resource: Service error:NotAuthorizedOrNotFound. Authorization failed or requested resource not found.. http status code: 404.
Solution
The error occurs because of user authorization. To resolve the issue, review
the following.
The following sample output for a virtual service route table
contains one Unknown condition. For a
successful custom resource creation, all the statuses are
true.
Copy
{
"conditions": [
{
"lastTransitionTime": "2022-01-07T08:35:43Z",
"message": "Dependencies resolved successfully",
"observedGeneration": 2,
"reason": "Successful",
"status": "True",
"type": "ServiceMeshDependenciesActive"
},
{
"lastTransitionTime": "2022-01-09T05:15:30Z",
"message": "Invalid RouteRules, route rules target weight sum should be 100 for the resource!",
"observedGeneration": 2,
"reason": "BadRequest",
"status": "Unknown",
"type": "ServiceMeshConfigured"
},
{
"lastTransitionTime": "2022-01-07T08:36:03Z",
"message": "Resource in the control plane is Active, successfully reconciled",
"observedGeneration": 1,
"reason": "Successful",
"status": "True",
"type": "ServiceMeshActive"
}
],
"virtualDeploymentIdForRules": [
[
"ocid1.meshvirtualdeployment.oc1.iad.amaaaaaamgzdkjyak7tam2jdjutbend4h7surdj4t5yv55qukvx43547kbnq"
]
],
"virtualServiceId": "ocid1.meshvirtualservice.oc1.iad.amaaaaaamgzdkjyagkax7xtz7onqlb65ec32dtu67erkmka2x6tlst4xigsa",
"virtualServiceName": "pet-details",
"virtualServiceRouteTableId": "ocid1.meshvirtualserviceroutetable.oc1.iad.amaaaaaamgzdkjyavcphbc5iobjqa6un7bqq2ki2xyfpvzfd6jzmdsv4l6va"
}
Cluster: Custom Resource Operations Fails with Webhook Errors 🔗
Issue
Custom Resource Operations Fails with Webhook Errors.
Copy
Error from server (InternalError): error when creating "service_mesh/mesh_create.yaml": Internal error occurred: failed calling webhook "mesh-validator.servicemesh.oci.oracle.cloud.com": failed to call webhook: Post "https://oci-service-operator-controller-manager-service.oci-service-operator-system.svc:443/validate-servicemesh-oci-oracle-com-v1beta1-mesh?timeout=10s": service "oci-service-operator-controller-manager-service" not found
Explanation
An improper installation causes this error and hence the operator has issues
calling the webhooks.
Solution
To make OCI Service Operator for Kubernetes installation successful, follow
these steps.
Clean up the existing operator installation in the corresponding
namespace.
Install the OCI Service Operator for Kubernetes Operator in the Kubernetes cluster in your namespace (oci-service-operator-system) using the following command.
Copy
operator-sdk run bundle --index-image quay.io/operator-framework/opm:v1.23.1 iad.ocir.io/oracle/oci-service-operator-bundle:X.X.X -n oci-service-operator-system --timeout 5m
Users must be logged into the Oracle Registry at iad.ocir.io in Docker to run the command. To ensure you're logged in, see Pulling Images Using the Docker CLI.
The command produces output similar to the following:
INFO[0036] Successfully created registry pod: iad-ocir-io-oracle-oci-service-operator-bundle-X-Y-Z
INFO[0036] Created CatalogSource: oci-service-operator-catalog
INFO[0037] OperatorGroup "operator-sdk-og" created
INFO[0037] Created Subscription: oci-service-operator-vX-X-X-sub
INFO[0040] Approved InstallPlan install-tzk5f for the Subscription: oci-service-operator-vX-Y-Z-sub
INFO[0040] Waiting for ClusterServiceVersion "oci-service-operator-system/oci-service-operator.vX.Y.Z" to reach 'Succeeded' phase
INFO[0040] Waiting for ClusterServiceVersion "oci-service-operator-system/oci-service-operator.vX.Y.Z" to appear
INFO[0048] Found ClusterServiceVersion "oci-service-operator-system/oci-service-operator.vX.Y.Z" phase: Pending
INFO[0049] Found ClusterServiceVersion "oci-service-operator-system/oci-service-operator.vX.Y.Z" phase: InstallReady
INFO[0053] Found ClusterServiceVersion "oci-service-operator-system/oci-service-operator.vX.Y.Z" phase: Installing
INFO[0066] Found ClusterServiceVersion "oci-service-operator-system/oci-service-operator.vX.Y.Z" phase: Succeeded
INFO[0067] OLM has successfully installed "oci-service-operator.vX.Y.Z"
Verify the successful installation of webhooks corresponding to all the
Service Mesh resources with kubectl.
Cluster: Operator / Application Pod Fails with "Error: runAsNonRoot and image will run as
root" 🔗
Issue
The operator or application pod fails with the following error.
Error: runAsNonRoot and image will run as root
Solution
This error occurs when podSecurityPolicies are enforced on your cluster and privileged access isn't given to the operator and application pods. To resolve the issue, you provide privileged access to all pods in operator namespaces where the operator and the application pods are running.
To run operator with pod security policies enabled:
Copy
# pod security policy to allow non-privileged access for all volumes and run as any user
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: operator-psp
spec:
privileged: true
seLinux:
rule: RunAsAny
supplementalGroups:
rule: RunAsAny
runAsUser:
rule: RunAsAny
fsGroup:
rule: RunAsAny
volumes:
- '*'
---
# Cluster role which grants access to use pod security policy
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: operator-psp-crole
rules:
- apiGroups:
- policy
resourceNames:
- operator-psp
resources:
- podsecuritypolicies
verbs:
- use
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: operator-psp-crole-binding
roleRef:
kind: ClusterRole
name: operator-psp-crole
apiGroup: rbac.authorization.k8s.io
subjects:
# Authorize all service accounts in oci-service-operator-system namespace
- kind: Group
apiGroup: rbac.authorization.k8s.io
name: system:serviceaccounts:<operator's namespace>
- kind: Group
apiGroup: rbac.authorization.k8s.io
name: system:serviceaccounts:<application namespace>
Cluster: VDB Is Active but Pod Doesn't Contain Sidecar Proxies 🔗
Issue
VDB is Active but pod doesn't contain sidecar proxies.
Solution
To resolve the issue, perform the following checks.
Check whether sidecar injection is enabled on the namespace level or at
the pod level. If not, enable it by following the steps here: Sidecar Injection on Pods.
Check whether SIDECAR_IMAGE is present in
oci-service-operator-servicemesh-config map:
If SIDECAR_IMAGE isn't present, then check whether an OCI policy exists for MESH_PROXY_DETAILS_READ for the dynamic group that customer used to create service mesh resources, see: Policies when Managing Service Mesh with kubectl.
Cluster: Pods Restarting Continuously 🔗
Issue
The pods in your cluster are restarting continuously.
Solution
Checking the following setting to resolve the issue.
Check whether SIDECAR_IMAGE is present in oci-service-operator-servicemesh-config map using the following command. If the sidecar isn't present, restart the pods.
Cluster: OCI Service Operator for Kubernetes Doesn't Uninstall Because of Finalizers 🔗
Issue
When uninstalling, CRDs aren't deleted because of finalizer clean-up. The system returns the following error message.
FATA[0123] Cleanup operator: wait for customresourcedefinition deleted: Get "https://x.x.x.x:6443/apis/apiextensions.k8s.io/v1/customresourcedefinitions/ingressgateways.servicemesh.oci.oracle.com": context deadline exceeded
Note
The example uses ingress gateways, but similar issues can happen with any mesh
resource.
Solution
Solution: To resolve this issue, perform the following steps.
Fetch all the CRDs that are yet to be deleted. Note
Tip
The operator SDK performs deletion in alphabetical order. Because the preceding error occurred at ingressgateways you're seeing all resources, including nonservice mesh resources that follow in alphabetical order.
Delete all the objects present in the preceding CRDs.
Tip
Try deleting child resources first and then proceed to
the parent resources. Otherwise, the deletion step gets
stuck.
Copy
## Delete all the objects for that custom resource
$ kubectl delete CUSTOM_RESOURCE --all --all-namespaces
## Once deleted, verify there are no more objects present for that custom resource
$ kubectl get CUSTOM_RESOURCE --all-namespaces
Delete all the remaining CRDs using the following command. CRDs deleted
before might produce not-found messages.
If OCI Service Operator for Kubernetes is installed in the default
namespace, then run the following command to get rid of other resources
created during installation.
If OCI Service Operator for Kubernetes is installed using -n
NAMESPACE, then delete the namespace to get rid of other
resources created during installation.