Trying to recover from an outage, rebalancing fails immediately

andreas.o · June 12, 2023, 5:49am

We had a major outage where two out of three nodes where offline for a while, once our servers where back online we had major issues because the single node that was left was running out of disk. We finally got some more disk added which got the cluster back to serving data, but none of the nodes are healthy still. When we try to rebalance all we get is:
Rebalance exited with reason {badmatch,
{error,{failed_nodes,[‘ns_1@-.-.-.14’]}}}

We are running: Community Edition 6.0.0 build 1693 on all the nodes.
I have done a cbcollect_info that I can share but I don’t want to attach it here.

We have had issues with the cluster before but never like this where the cluster just straight up doesn’t seem to want to try a rebalance. I can ping between the nodes with low latency and nc reports open ports on 8091-8094 9100-9105.

Would be grateful for any help regarding this.

I found this in the reports.log,
exception exit: {{badmatch,
{error,
{setup_replications_failed,
[{‘ns_1@-.-.-.12’,
{errors,
[{34,999},
{34,936},
{34,919},
{34,823},
{34,320}

And the memcache logs have a lot of closed stream messages:
2023-06-12T06:57:24.761894+02:00 INFO 875: (Catalog) DCP (Consumer) eq_dcpq:replication:ns_1@-.-.-.12->ns_1@-.-.-.14:Catalog - (vb 1019) Setting stream to dead state, last_seqno is 102814, unAckedBytes is 0, status is The stream closed due to a close stream message.

perry · June 12, 2023, 12:45pm

Do you have a backup of the data that you could reload? Did any of the nodes get failed over?

andreas.o · June 12, 2023, 2:32pm

We will attempt to restore the entire cluster to a backup. Thank you.

system · September 10, 2023, 2:33pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Node failure while rebalancing, won't come back up... data loss? Couchbase Server	1	2013	October 22, 2013
Unable to rebalance cluster after node failure Couchbase Server	3	2129	July 29, 2013
Couchbase cluster stuck after node failure Couchbase Server	2	2106	February 7, 2017
CE 6.6.0 rebalance fail Couchbase Server	3	1258	January 26, 2021
Failure Recovery - Can't Rebalance Couchbase Server	6	3724	November 9, 2014

Trying to recover from an outage, rebalancing fails immediately

Related topics