I was trying to setup the Couchbase Operator on IBM Cloud Kubernetes, but face an issue while adding Persistent storage to the cluster. After running the cbopctl create command the Couchbase services are created but Couchbase pods are not. The Persistent Volume is in the Pending state and then gets deleted on its own. Here’s the error in the operator logs -
time="2018-11-23T09:54:39Z" level=info msg="deleted pod (cb-example-0000)" cluster-name=cb-example module=cluster
time="2018-11-23T09:54:39Z" level=error msg="Cluster setup failed: fail to create member's pod (cb-example-0000): failed to create persistent volume claim: context deadline exceeded" cluster-name=cb-example module=cluster
time="2018-11-23T09:54:39Z" level=warning msg="Fail to handle event: ignore failed cluster (cb-example). Please delete its CR"
Here is the yaml that I used to create the operator -
This is a known issue for clouds/storage providers that have poor performance characteristics. At present in Operator <=1.1.0 we have a timeout set for 5 minutes, which is evidently not long enough for IBM Cloud.
We have a fix planned for Operator 1.2.0 (to be released early 2019) that will allow you to override the default timeout.
To my mind 5 minutes to create a persistent volume is somewhat excessive. I’d be interested to know IBM’s take on why this is taking so long. They may be able to offer some workarounds to improve performance in the short term and allow your deployment.
It’s a lot less than 5 minutes. Here’s the entire log -
time="2018-12-06T19:17:49Z" level=info msg="Janitor process starting" cluster-name=cb-example module=cluster
time="2018-12-06T19:17:49Z" level=info msg="Setting up client for operator communication with the cluster" cluster-name=cb-example module=cluster
time="2018-12-06T19:17:49Z" level=info msg="Cluster does not exist so the operator is attempting to create it" cluster-name=cb-example module=cluster
time="2018-12-06T19:17:49Z" level=info msg="Creating headless service for data nodes" cluster-name=cb-example module=cluster
time="2018-12-06T19:17:49Z" level=info msg="Creating NodePort UI service (cb-example-ui) for data nodes" cluster-name=cb-example module=cluster
time="2018-12-06T19:17:49Z" level=info msg="Creating a pod (cb-example-0000) running Couchbase enterprise-5.5.1" cluster-name=cb-example module=cluster
time="2018-12-06T19:19:49Z" level=info msg="deleted pod (cb-example-0000)" cluster-name=cb-example module=cluster
time="2018-12-06T19:19:49Z" level=error msg="Cluster setup failed: fail to create member's pod (cb-example-0000): failed to create persistent volume claim: context deadline exceeded for pvc-couchbase-cb-example-0000-00-index" cluster-name=cb-example module=cluster
time="2018-12-06T19:19:49Z" level=warning msg="Fail to handle event: ignore failed cluster (cb-example). Please delete its CR"
Looking at the logs it times out after exactly 2 minutes. Is this parameter configurable?
@raju Thanks for your response. I don’t think i have permission to create this issue. Could you raise this issue or help me get the required permissions?
I see a new 1.2-DP operator image. Has this parameter been made configurable yet? I don’t have access to documentation for 1.2. It still times out after 2 minutes.
Very observant! That image is for a developer preview release, as such the documentation is not public domain yet. To answer your question, yes we have addressed your issue
The 1.2.0 GA is scheduled for release in approximately a month. Keep an eye on our blog and we’ll link to all the documentation and download resources when the time comes.
@simon.murray
Thank you so much for replying. We are really excited to hear this.
Our problem is that we have a release coming up this month and we need Couchbase setup by the end of this week.
If it’s already a part of the DP build, could you send me the parameter name? (I’ll modify the CRDs accordingly)
If there’s any other way that I could get the fix or if you have any alternate workaround that would be great too.
I’ll do my best to help you achieve your milestone then
So first up in your operator deployment add --pod-create-timeout=10m as an argument. It will accept anything that time.ParseDuration() will consume in golang.
Second, there are a few new attributes in the CouchbaseCluster resource that need to be filled in. Sane defaults are:
Thank you @simon.murray. I really appreciate you helping me with this.
I’ve tried deploying it with the new changes, but I’m not sure what I’m doing wrong. It still times out after 2 minutes.
Attaching the operator and couchbase cluster yaml files along with the operator logs. Could you take a look at them and let me know if you find something.
time="2019-02-20T13:21:51Z" level=info msg="Obtaining resource lock" module=main
time="2019-02-20T13:21:51Z" level=info msg="Starting event recorder" module=main
time="2019-02-20T13:21:51Z" level=info msg="Attempting to be elected the couchbase-operator leader" module=main
time="2019-02-20T13:22:08Z" level=info msg="I'm the leader, attempt to start the operator" module=main
time="2019-02-20T13:22:08Z" level=info msg="Creating the couchbase-operator controller" module=main
time="2019-02-20T13:22:08Z" level=info msg="Event(v1.ObjectReference{Kind:\"Endpoints\", Namespace:\"test-op\", Name:\"couchbase-operator\", UID:\"5604a51f-eef4-11e8-81b6-96f8cfb4c54c\", APIVersion:\"v1\", ResourceVersion:\"64054135\", FieldPath:\"\"}): type: 'Normal' reason: 'LeaderElection' couchbase-operator-6ff9589d49-gt7hm became leader" module=event_recorder
time="2019-02-20T13:22:08Z" level=info msg="CRD initialized, listening for events..." module=controller
time="2019-02-20T13:23:46Z" level=info msg="Watching new cluster" cluster-name=cb-example module=cluster
time="2019-02-20T13:23:46Z" level=info msg="Janitor process starting" cluster-name=cb-example module=cluster
time="2019-02-20T13:23:46Z" level=info msg="Setting up client for operator communication with the cluster" cluster-name=cb-example module=cluster
time="2019-02-20T13:23:46Z" level=info msg="Cluster does not exist so the operator is attempting to create it" cluster-name=cb-example module=cluster
time="2019-02-20T13:23:46Z" level=info msg="Creating headless service for data nodes" cluster-name=cb-example module=cluster
time="2019-02-20T13:23:46Z" level=info msg="Created service cb-example-ui for admin console" cluster-name=cb-example module=cluster
time="2019-02-20T13:23:46Z" level=info msg="Creating a pod (cb-example-0000) running Couchbase 6.0.1" cluster-name=cb-example module=cluster
time="2019-02-20T13:25:47Z" level=info msg="deleted pod (cb-example-0000)" cluster-name=cb-example module=cluster
time="2019-02-20T13:25:47Z" level=error msg="Cluster setup failed: fail to create member's pod (cb-example-0000): failed to create persistent volume claim: context deadline exceeded for cb-example-0000-default-00" cluster-name=cb-example module=cluster
time="2019-02-20T13:25:47Z" level=warning msg="Fail to handle event: ignore failed cluster (cb-example). Please delete its CR"
I apologize for the length of the post. I do not have permissions to attach files.
No worries, it shows me all I need to know. I see your problem, it seems the PVC wait code is hard coded to 2 minutes and doesn’t honor the global pod creation timeout. I’ll raise a defect and get it fixed straight away. I’ll have a chat with our project management team to see if there’s anything we can do help you by Friday.
As Simon mentioned we are working on fixing that issue in 1.2 release. Just wanted to let you guys know that 1.2 GA is tentatively planned for April - May timeframe. I would not recommend using a DP version for your release and wait for final GA version.
Can you please contact me anil@couchbase.com and I can assist with giving you an early drop with fix for testing purposes.