Installing couchbase operator using helm fails readiness probes

Hello Couchbase,

I am using the helm chart v2.3.001 (last as of now), which contains the couchbase operator v2.3.0.

After creating a fresh k8s cluster v1.23.6 (deployed via kind or kubespray, calico as CNI) and installing the chart by issuing the following command: helm install default couchbase/couchbase-operator.

kubectl get pods:

jupiter-0000 0/1 Running 0 8m
jupiter-couchbase-admission-controller-5d5ff4d897-ldk5b 1/1 Running 0 25h
jupiter-couchbase-operator-7bf5b8556-98khq 1/1 Running 0 25h

kubectl describe pods jupiter-0000:

Events:
Type Reason Age From Message


Normal Scheduled 3m39s default-scheduler Successfully assigned default/jupiter-0000 to node2
Normal Pulled 3m39s kubelet Container image “couchbase/server:7,0,2” already present on machine
Normal Created 3m39s kubelet Created container couchbase-server-init
Normal Started 3m39s kubelet Started container couchbase-server-init
Normal Pulled 3m39s kubelet Container image “couchbase/server:7.0.2” already present on machine
Normal Created 3m39s kubelet Created container couchbase-server
Normal Started 3m39s kubelet Started container couchbase-server
Warning Unhealthy 5s (x13 over 3m20s) kubelet Readiness probe failed: dial tcp 10.233.96.212:8091: connect: connection refused

kubectl logs of the operator:
{“level”:“info”,“ts”:1652886554.1603956,“logger”:“cluster”,“msg”:“Pod deleted”,“cluster”:“default/jupiter”,“name”:“jupiter-0000”}
{“level”:“info”,“ts”:1652886554.169,“logger”:“cluster”,“msg”:“Reconciliation failed”,“cluster”:“default/jupiter”,“error”:“fail to create member’s pod (jupiter-0000): dial tcp 10.233.90.54:8091: connect: connection refused”,“stack”:“github.com/couchbase/couchbase-operator/pkg/util/netutil.WaitForHostPort\n\tgithub.com/couchbase/couchbase-operator/pkg/util/netutil/netutil.go:37\ngithub.com/couchbase/couchbase-operator/pkg/util/k8sutil.WaitForPod\n\tgithub.com/couchbase/couchbase-operator/pkg/util/k8sutil/k8sutil.go:289\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).waitForCreatePod\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/pod.go:108\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).createPod\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/pod.go:41\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).createMember\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/member.go:168\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).createInitialMember\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/member.go:310\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).create\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:325\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).reconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/reconcile.go:148\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).runReconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:481\ngithub.com/couchbase/couchbase-operator/pkg/cluster.(*Cluster).Update\n\tgithub.com/couchbase/couchbase-operator/pkg/cluster/cluster.go:524\ngithub.com/couchbase/couchbase-operator/pkg/controller.(*CouchbaseClusterReconciler).Reconcile\n\tgithub.com/couchbase/couchbase-operator/pkg/controller/controller.go:90\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\tsigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/controller-runtime@v0.11.0/pkg/internal/controller/controller.go:227”}

kubectl logs for the server:

Starting Couchbase Server – Web UI available at http://:8091
and logs available in /opt/couchbase/var/lib/couchbase/logs
chown: changing ownership of ‘var/lib/couchbase’: Operation not permitted

Can someone please help what is the root cause of this issue ?

I have tried getting inside the couchbase operator pod to run the cbopinfo, but seems like I was not able to do it, it seems like the docker images is stripped from all binaries like sh or bash, I would love to know what are the steps needed to run cbopinfo inside a k8s cluster to I can provide you with more logs.

Cheers.

Hi!

Could you share the CRDs for the couchbase cluster that operator tried to apply? Are you using persistent volume claims?

Thank you!

Hello Dmitrii,

Here is the CRDs that I applied before I install the operator:

Yes, I am using PVC with storage classes:

        services:
        - data
        - index
        - query
        - search
        size: 1
        volumeMounts:
          default: couchbase
          data: couchbase
          index: couchbase
    volumeClaimTemplates:
      - metadata:
          name: couchbase
        spec:
          storageClassName: default
          resources:
            requests:
              storage: 50Gi

My default storage class is the local path provisioner (GitHub - rancher/local-path-provisioner: Dynamically provisioning persistent local storage with Kubernetes).

Also, if that might help you, I am using the security context provided by default in the helm chart:

 securityContext:
    fsGroup: 1000
    # -- Indicates that the container must run as a non-root user. If true, the
    # Kubelet will validate the image at runtime to ensure that it does not run
    # as UID 0 (root) and fail to start the container if it does. If unset or
    # false, no such validation will be performed. May also be set in
    # SecurityContext.  If set in both SecurityContext and PodSecurityContext,
    # the value specified in SecurityContext takes precedence.
    runAsNonRoot: true
    runAsUser: 1000
    sysctls: []
    # -- The Windows specific settings applied to all containers. If
    # unspecified, the options within a container's SecurityContext will be
    # used. If set in both SecurityContext and PodSecurityContext, the value
    # specified in SecurityContext takes precedence. Note that this field cannot
    # be set when spec.os.name is linux.
    windowsOptions: {}

Could you try setting couchbaseclusters.spec.securityContext.fsGroup to 1000 as described here? You also may need to set runAsUser to 1000, according to this.

Thank you

This is exactly what I am doing. By default, it is set like you described, thanks.

Most likely this behavior is caused by configuration of the Calico CNI networking stack, and so I suggest trying with default kindnet to see if it resolves the readiness error:

There are also logs for Calico Pods which may uncover some interesting information for debugging. I’ve personally spun up a kind cluster and did a successful install, even with the permission denied error.

Thanks @tommie

I just installed it in kind and it works fine. Kubespray does not support kindnet, I tried different CNI plugins and I get even more errors, many pods crashing :confused:

I uploaded the cao logs collected by running ./cao collect-logs --log-level=1:

https://gofile.io/d/hhXjvF

I run my k8s cluster on bare metal with an L2 connectivity, I kept the default values for Calico that are set by kubespray: kubespray/inventory/sample/group_vars/k8s_cluster/k8s-net-calico.yml at master · kubernetes-sigs/kubespray · GitHub.

I can’t see any errors in the calico pods.

Cheers.

Thanks for the logs! From an Operator perspective all I can tell is that there is a failure to communicate with the Pods over this network.

message: 'fail to create member''s pod (default-0000): dial tcp 10.233.96.4:8091:\n+ \t    connect: connection refused'

At this point it’s probably best to drop a line with someone supporting Calico to help resolve issue as I’m not sure what the root cause is for this.