HA question: what happens when 2 nodes down

eugene-rotar · October 23, 2022, 1:28am

Hello!

I have a question about HA (high availability) of the cluster when we down 2 nodes at the same (or almost the same) time.
We’re testing scenario with outage in one of the AWS AZ (availability zone). So having cluster with many nodes distributed evenly between several AZs, loosing one of the AZs causes issue when more than 1 node may be lost

Prerequisite:

Multi-node cluster with 5 nodes, CE 7.1.1, 1 bucket with 1 replica. Data, index and query services are running on every node

Scenario:

We’re loosing AZ-3, meaning both Node-4 and Node-5 become unresponsive

Couchbase fails one of the nodes (Node-4, for example) over
Couchbase notifies the auto failover maximum reached
The second failed node (Node-5) is unresponsive, but it is still a part of the cluster
We try to make read query, but get the following error:

[
  {
    "code": 12008,
    "msg": "Error performing bulk get operation  - cause: {1 errors, starting with dial tcp IP5:11210: connect: no route to host}",
    "retry": true
  }
]

Where IP5 is IP of the Node-5
5. Starting from here manual intervention is needed (reset auto failover quota or manually hard failover Node-5 from the cluster), otherwise we can’t read data from the cluster

Question 1:

Is there any way to increase auto failover quorum up to 2-3 nodes or any other way to handle this case automatically?
Otherwise it looks like it doesn’t make sense to have more than 3 nodes in a cluster with 3 AZs (assuming read down-time longer than 2-3 mins is critical for us)

Question 2:

What is the recommendation or best practices in cases like this?

Thank you!

vsr1 · October 25, 2022, 12:20am

https://docs.couchbase.com/operator/current/concept-server-groups.html

eugene-rotar · October 26, 2022, 2:42pm

Thanks! Indeed exactly what I looked!

Topic		Replies	Views
Question on Recovering cluster Couchbase Server	4	1234	April 7, 2017
Why 3 node cluster for Automatic Failover?: Couchbase Server	3	4404	July 21, 2017
Couchbase HA issues Couchbase Server	2	843	February 15, 2018
What happens when a node in the cluster goes down? Couchbase Server	14	21723	December 29, 2018
Couchbase 6.0 Node down Couchbase Server	2	1145	September 13, 2020

HA question: what happens when 2 nodes down

Related topics