Trying to recover from an outage, rebalancing fails immediately

We had a major outage where two out of three nodes where offline for a while, once our servers where back online we had major issues because the single node that was left was running out of disk. We finally got some more disk added which got the cluster back to serving data, but none of the nodes are healthy still. When we try to rebalance all we get is:
Rebalance exited with reason {badmatch,
{error,{failed_nodes,[‘ns_1@-.-.-.14’]}}}

We are running: Community Edition 6.0.0 build 1693 on all the nodes.
I have done a cbcollect_info that I can share but I don’t want to attach it here.

We have had issues with the cluster before but never like this where the cluster just straight up doesn’t seem to want to try a rebalance. I can ping between the nodes with low latency and nc reports open ports on 8091-8094 9100-9105.

Would be grateful for any help regarding this.

I found this in the reports.log,
exception exit: {{badmatch,
{error,
{setup_replications_failed,
[{‘ns_1@-.-.-.12’,
{errors,
[{34,999},
{34,936},
{34,919},
{34,823},
{34,320}

And the memcache logs have a lot of closed stream messages:
2023-06-12T06:57:24.761894+02:00 INFO 875: (Catalog) DCP (Consumer) eq_dcpq:replication:ns_1@-.-.-.12->ns_1@-.-.-.14:Catalog - (vb 1019) Setting stream to dead state, last_seqno is 102814, unAckedBytes is 0, status is The stream closed due to a close stream message.

Do you have a backup of the data that you could reload? Did any of the nodes get failed over?

1 Like

We will attempt to restore the entire cluster to a backup. Thank you.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.