CPU usage stuck on 100%

We have a 3 node environment;
As of this morning, I noticed one of the nodes is now on ‘pending’ and it’s CPU usage is stuck at 100%. I have noticed a large increase in documents for one bucket in particular, what I’m interested though is how to ‘reset’ or ‘clear’ the cpu activity for the node stuck on pending.

*Looking through the Couchbase UI the only thing I noticed is that the node in question has almost 20k items in the large bucket’s DCP queue. I’m not sure if it’s relevant.

I’ve tried failing over the node (if the numbers shown to me in the UI are correct, the other 2 nodes should be able to handle the load on a rebalance), but it fails every single time ,

originally with this error in the logs:
Rebalance exited with reason {unexpected_exit,
{‘EXIT’,<0.4710.460>,
{bulk_set_vbucket_state_failed, …

and now every time with this error:
Rebalance exited with reason {{badmatch,
{errors, …

so anyone know how I could clear up that CPU usage and get my cluster back on track?

Hi,

Sorry that your cluster get into such a state. I have a few suggestions but I wanted to make sure we proceed cautiously here.

  1. Which version of CB are you using? CE or EE?
  2. Which OS are you using?
  3. Is this a prod cluster or test cluster? Do you have replica?

The first thing I want to make sure that there are no data loss here because of the node going into pending state. If needed, please backup data in that node.

Second, I would try to bring the node from pending state into normal state. Couple of things I would suggest:

  1. Restart CB process. (Again, please make sure to back up if needed).
  2. Add more CPU to that node, if possible.

Once we can bring that node into normal state, we can look into how to distribute the load. I would recommend to add additional node and then try rebalance.

Thanks,
Qi

We ended up resolving the issue through a backup + some config changes on the installation in ubuntu. (transparent hugepage, swappiness)

We kinda knew the reason it was happening (insufficient hardware specs) but were also doing so on purpose to simulate problems and having us figuring out ways to fix them (maintenance team)

What we have learned from all this, is to make sure to scale up before it becomes an issue, because once that problem is encountered, the whole cluster seems to go in limbo - we did not find any quick and satisfying way to get it back on track. :smiley:

Thanks for the help though