Helm Chart - Cluster Setup Failed

tcoates · June 24, 2019, 6:31pm

Hello,

I’m using both the couchbase-operator and couchbase-cluster charts as subcharts in my application. This means that couchbasecluster object is already present when the operator pod starts up. On some installs, the cluster setup fails with this message in the operator logs:

time=“2019-06-24T18:16:33Z” level=info msg=“couchbase-operator v1.2.0 (release)” module=main
time=“2019-06-24T18:16:33Z” level=info msg=“Obtaining resource lock” module=main
time=“2019-06-24T18:16:33Z” level=info msg=“Starting event recorder” module=main
time=“2019-06-24T18:16:33Z” level=info msg=“Attempting to be elected the couchbase-operator leader” module=main
time=“2019-06-24T18:16:33Z” level=info msg=“Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"txt", Name:"couchbase-operator", UID:"326a87a4-96ac-11e9-890f-eabc2b90ead3", APIVersion:"v1", ResourceVersion:"962566", FieldPath:""}): type: ‘Normal’ reason: ‘LeaderElection’ txt-couchbase-operator-5969b769bb-cn88q became leader” module=event_recorder
time=“2019-06-24T18:16:33Z” level=info msg=“I’m the leader, attempt to start the operator” module=main
time=“2019-06-24T18:16:33Z” level=info msg=“Creating the couchbase-operator controller” module=main
time=“2019-06-24T18:16:33Z” level=info msg=“CRD initialized, listening for events…” module=controller
time=“2019-06-24T18:16:33Z” level=info msg=“Watching new cluster” cluster-name=txt-couchbase-cluster module=cluster
time=“2019-06-24T18:16:33Z” level=info msg=“Janitor process starting” cluster-name=txt-couchbase-cluster module=cluster
time=“2019-06-24T18:16:33Z” level=info msg=“Setting up client for operator communication with the cluster” cluster-name=txt-couchbase-cluster module=cluster
time=“2019-06-24T18:16:33Z” level=info msg=“Cluster does not exist so the operator is attempting to create it” cluster-name=txt-couchbase-cluster module=cluster
time=“2019-06-24T18:17:04Z” level=error msg=“Cluster setup failed: cluster create: failed to update cluster phase (Creating): Timeout: request did not complete within requested timeout 30s” cluster-name=txt-couchbase-cluster module=cluster
time=“2019-06-24T18:17:04Z” level=warning msg=“Fail to handle event: ignore failed cluster (txt-couchbase-cluster). Please delete its CR”

Other times, the cluster is created successfully. It seems that there is a race condition between the operator and the admission controller pods. In cases where the admission controller pod starts before the operator pod, the cluster is created successfully. In cases where the operator pod starts first, I always see the above error. Once I see that error I can manually remove the couchbasecluster and recreate and the cluster creates successfully. This is obviously not a good solution for production.

Should there be some sort of retry performed if the cluster fails due to what I assume is an issue communicating with the admission controller?

tommie · June 24, 2019, 11:19pm

Hi Tom,

The admission controller mainly validates the cluster, although this also means that the admission controller will try to inject default values that may be missing from the cluster spec. We’ve tried to ensure that the default cluster values wouldn’t be dependent on having the admission controller running, which means the cluster should still start if you set :
deployments.admissionController=false.
However, if the cluster never starts without the admission controller (even with retrying), then I suspect the issue is some value isn’t being injected in time due to a race at startup. And if that’s the case I’ll need to see your chart values+overrides to look further.

What also seems telling is this timeout:
Timeout: request did not complete within requested timeout 30
Which can occur in cluster using many Persistent volumes or slow network pulling images. Try increasing pod-create timeout by adding the following to values file:

couchbaseOperator:
   # name of the couchbase operator
  name: "couchbase-operator"
....
   commandArgs:
     # Set to false if you are installing the CRD manually
     create-crd: true
     pod-create-timeout: 120

github.com

couchbase-partners/helm-charts/blob/b27f5a58175a1915543c59d8dbe760f8315edb13/couchbase-operator/values.yaml#L36


# name of the couchbase operator
name: "couchbase-operator"
# image config
image:
  repository: couchbase/operator
  tag: 1.2.0-DP
imagePullPolicy: IfNotPresent
# imagePullSecrets is an optional list of references to secrets  to use for pulling images
imagePullSecrets: []
# additional command arguments will be translated to `--key=value`
commandArgs:
  # Set to false if you are installing the CRD manually
  create-crd: true
# readinessProbe marks pod as ready when readiness port (8080) responds
readinessProbe:
  # initialDelaySeconds is a delay before readiness probe is initiated
  initialDelaySeconds: 3
  # periodSeconds decides how often to perform the probe
  periodSeconds: 3
  # failureThreshold sets Pod status to failure when threshold is reached
  failureThreshold: 19

tcoates · June 25, 2019, 11:44pm

Thanks. I already had the pod-create-timeout setting at 120 seconds as that was my first thought on what the problem might be. As you can see from the logs, the timeout still occurs at 30 seconds which leads me to believe that the issue is not with the pod creation, but some part of the process further upstream.

I disabled the admission-controller on your recommendation and that does seem to solve the immediate problem. The cluster is consistently created now. However, this does seem like a poor work around as having the additional validation of the admission-controller would be a good thing. If this is something that is being tracked in a bug I would like to follow it so that I can restore that functionality when the issue is resolved.

Thanks again for your help.

tommie · June 25, 2019, 11:58pm

Ok thanks Tom,
Passing with admission controller disabled is actually the opposite of what I expected. From the original post it sounded like Pod creation fails unless the admission controller starts first.

The workaround was just to help understand the problem. We don’t have a ticket for this yet, but if you can share your charts I can try and reproduce the issue and see what’s happening here.
Also, I’m assuming everything works non-helm to exclude anything from operator itself.

tcoates · June 26, 2019, 12:19am

Yes - when I deploy without helm, things work just fine. It seems to be the fact that when the operator starts up, there is already a couchbasecluster defined and if the admission-controller isn’t there, things fail. Specifically, it seems like this is the key error message:

Cluster setup failed: cluster create: failed to update cluster phase (Creating)

My chart is using our own custom containers for the application side so I can’t send it over as-is. Let me put together a more simplistic example and post back here.

tcoates · June 26, 2019, 4:04pm

I’ve attached a super simple example chart. I it installed with this command:

helm install -n cbtest --namespace cbtest cbtest-0.1.0.tgz

It failed to create the cluster with the same error as I posted earlier on the first attempt but succeeded on the second.

cbtest-0.1.0.tgz.zip (8.8 KB)

tommie · June 26, 2019, 10:50pm

Thanks Tom,

Looks like the pod-timeout indentation is off a few spaces and the timeout wasn’t being set in your example here.
Running as is, I did get a similar timeout on my setup, but fixed it by correcting the indentation in your file. Everything created in one shot. Here’s the snippet I fixed, and full value file attached,

couchbase-operator:
  deployments:
    admissionController: true
  rbac:
    clusterRoleAccess: false
  couchbaseOperator:
    commandArgs:
      create-crd: false
      pod-create-timeout: 120s

I also shifted couchbaseOperator to the left, previously it was nested under rbac.

Hopefully this works in your setup as well
values.yaml.zip (692 Bytes)

tcoates · June 27, 2019, 12:46am

Whoops - that’s a mistake on my part. However, I updated my test chart with your values.yaml and I’m still seeing the same problem:

time=“2019-06-27T00:41:38Z” level=info msg=“couchbase-operator v1.2.0 (release)” module=main
time=“2019-06-27T00:41:38Z” level=info msg=“Obtaining resource lock” module=main
time=“2019-06-27T00:41:38Z” level=info msg=“Starting event recorder” module=main
time=“2019-06-27T00:41:38Z” level=info msg=“Attempting to be elected the couchbase-operator leader” module=main
time=“2019-06-27T00:41:38Z” level=info msg=“I’m the leader, attempt to start the operator” module=main
time=“2019-06-27T00:41:38Z” level=info msg=“Creating the couchbase-operator controller” module=main
time=“2019-06-27T00:41:38Z” level=info msg=“Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"cbtest", Name:"couchbase-operator", UID:"52caf658-9874-11e9-8d44-4e5271016bf8", APIVersion:"v1", ResourceVersion:"1456953", FieldPath:""}): type: ‘Normal’ reason: ‘LeaderElection’ cbtest-couchbase-operator-76f7996cbb-srj58 became leader” module=event_recorder
time=“2019-06-27T00:41:38Z” level=info msg=“Watching new cluster” cluster-name=cbtest-couchbase-cluster module=cluster
time=“2019-06-27T00:41:38Z” level=info msg=“Janitor process starting” cluster-name=cbtest-couchbase-cluster module=cluster
time=“2019-06-27T00:41:38Z” level=info msg=“Setting up client for operator communication with the cluster” cluster-name=cbtest-couchbase-cluster module=cluster
time=“2019-06-27T00:41:38Z” level=info msg=“Cluster does not exist so the operator is attempting to create it” cluster-name=cbtest-couchbase-cluster module=cluster
time=“2019-06-27T00:42:08Z” level=error msg=“Cluster setup failed: cluster create: failed to update cluster phase (Creating): Timeout: request did not complete within requested timeout 30s” cluster-name=cbtest-couchbase-cluster module=cluster
time=“2019-06-27T00:42:08Z” level=warning msg=“Fail to handle event: ignore failed cluster (cbtest-couchbase-cluster). Please delete its CR”

On the second attempt with no changes everything worked.

tommie · June 27, 2019, 4:29pm

Ok, thanks for giving it a try.
Which version of kubernetes is this? And also which platform are you running on? ie… AKS, AWS, minikube?

After verifying the chart you sent works - at least in AWS, I’m wondering if something environment related is happening. Specifically because the cause of error is due to the operator attempting to update the cluster status to ‘Creating’ which is a trivial task being sent over k8s rest api that shouldn’t timeout. This is pre-Pod creation.
Also, assuming just 1 operator is running in 1 namespace at a time?

Only thing I can think of as a fix on our end is to apply the timeout to this section of code that updates status. Because, if you have it set to 120s then it shouldn’t fail ‘within requested timeout 30s’. Currently timeout only applies to pod creation process.

tcoates · June 27, 2019, 5:25pm

No problem. I’m running this in AKS, and yes, just one operator per namespace. Right now, as a work around, I’m just running without the admission-controller which seems to work every time. Thanks for your help tracking this down.

tommie · June 27, 2019, 7:54pm

Sure, and what is the kubernetes version you’re using in AKS… 1.12, 1.13?
We’ll see what we can find, and If this becomes really problematic to workaround please also provide output from https://docs.couchbase.com/operator/1.2/cbopinfo.html and I’ll take another look.
thanks!

tcoates · July 3, 2019, 4:15pm

Sorry for the delay - I was out of the office for a few days. All my testing has been done on Kubernetes 1.13.5.

I captured the cbopinfo output from both a good and bad run. I’m not totally sure if there is any sensitive information included, so there is there some way I can send those to you privately?

Thanks,
-Tom

Topic		Replies	Views
Failed calling webhook admission controller during cluster creation Kubernetes	2	1732	April 26, 2022
Installing couchbase operator using helm fails readiness probes Kubernetes	7	1353	June 7, 2022
Operator error: "Reconciler error" "unexpected cluster phase: Failed" Kubernetes	1	1427	July 28, 2021
Couchbase Operator Kubernetes	10	2946	April 24, 2020
couchbase operator one cluster couchbase pod in completed state Kubernetes	1	1266	July 9, 2019

Helm Chart - Cluster Setup Failed

Related topics