We had a major outage where two out of three nodes where offline for a while, once our servers where back online we had major issues because the single node that was left was running out of disk. We finally got some more disk added which got the cluster back to serving data, but none of the nodes are healthy still. When we try to rebalance all we get is:
Rebalance exited with reason {badmatch,
{error,{failed_nodes,[‘ns_1@-.-.-.14’]}}}
We are running: Community Edition 6.0.0 build 1693 on all the nodes.
I have done a cbcollect_info that I can share but I don’t want to attach it here.
We have had issues with the cluster before but never like this where the cluster just straight up doesn’t seem to want to try a rebalance. I can ping between the nodes with low latency and nc reports open ports on 8091-8094 9100-9105.
Would be grateful for any help regarding this.
I found this in the reports.log,
exception exit: {{badmatch,
{error,
{setup_replications_failed,
[{‘ns_1@-.-.-.12’,
{errors,
[{34,999},
{34,936},
{34,919},
{34,823},
{34,320}
And the memcache logs have a lot of closed stream messages:
2023-06-12T06:57:24.761894+02:00 INFO 875: (Catalog) DCP (Consumer) eq_dcpq:replication:ns_1@-.-.-.12->ns_1@-.-.-.14:Catalog - (vb 1019) Setting stream to dead state, last_seqno is 102814, unAckedBytes is 0, status is The stream closed due to a close stream message.