Hello!
I have a question about HA (high availability) of the cluster when we down 2 nodes at the same (or almost the same) time.
We’re testing scenario with outage in one of the AWS AZ (availability zone). So having cluster with many nodes distributed evenly between several AZs, loosing one of the AZs causes issue when more than 1 node may be lost
Prerequisite:
Multi-node cluster with 5 nodes, CE 7.1.1, 1 bucket with 1 replica. Data, index and query services are running on every node
Scenario:
We’re loosing AZ-3, meaning both Node-4 and Node-5 become unresponsive
- Couchbase fails one of the nodes (Node-4, for example) over
- Couchbase notifies the auto failover maximum reached
- The second failed node (Node-5) is unresponsive, but it is still a part of the cluster
- We try to make read query, but get the following error:
[
{
"code": 12008,
"msg": "Error performing bulk get operation - cause: {1 errors, starting with dial tcp IP5:11210: connect: no route to host}",
"retry": true
}
]
Where IP5 is IP of the Node-5
5. Starting from here manual intervention is needed (reset auto failover quota or manually hard failover Node-5 from the cluster), otherwise we can’t read data from the cluster
Question 1:
Is there any way to increase auto failover quorum up to 2-3 nodes or any other way to handle this case automatically?
Otherwise it looks like it doesn’t make sense to have more than 3 nodes in a cluster with 3 AZs (assuming read down-time longer than 2-3 mins is critical for us)
Question 2:
What is the recommendation or best practices in cases like this?
Thank you!