Couchbase 6.0 Node down

Hi,

We are doing some test with multiple nodes configuration and we are not able to understand the implemented logic.

In a 2 nodes configuration with 1 bucket and 1 replica we can see per node we have the total number of items/document in each node (active + replica). It is interested thought it is not split in a 50/50% configuration in some situations, but fair.

In this configuration if we kill one node (systemctl stop couchbase-server.service), no operation can be done to the other still alive node.
all of them return error.

{
“code”: 12008,
“msg”: “Error performing bulk get operation - cause: {1 errors, starting with dial tcp 10.17.11.202:11210: getsockopt: connection refused}”
}

Why the query (using the GUI) trying to connect to the dead node?

For the 3 nodes configuration we have a similar issue. if we kill one node, the cluster doesn’t response successfully to any operation (query, insert, etc) - same error are above, until a fail-over happens of the dead node.
Is this really the expected behaviour? We know we can configure the autofail over to 5 sec for a cluster with more than 2 node (with 2 seems no failover is performed). But still, 5 seconds of a non-responsive DB sounds a problem for us.

For a 2 nodes cluster, it seems only manual intervention to be able to using the DB again… doing a manual failover of the dead node. It is odd though when we try to do that a prompt tells us we are going to lose data!!.. Why? the other node should have all the vBuckets (Active + Replica).

Are we missing something?

Thanks

Thanks.

1 Like

Doing some other test we see the statement of the replicas can be used to READ when one node is down is not happening.

In a 10 document bucket within a 3 nodes with 1 replica, if one node goes down, some of the documents are not retrievable using key. In our test, 2 of 10 returned “Internal Server Error” using GUI Document menu.
SQL Query always fails.: In this scenario

Only when failover is executed all doc are accessible. Why not allow to read from replica? I can understand preventing writting…

Waiting for the failover is a real problem for High Availability. Set it to the min failover time will drop node all the time and using a high value will make the cluster almost completely unavailable.

Writing is only available for the vBucket active in the other nodes, writting a key with a vBucket in the node down return error. We understand the hashing for the vBucket, but if a node goes down… the option to write to other node should be possible. After the node is back or failover, it can rebalance de vBuckets with the new Inserts…

Summarising: it seems when a node goes down, the cluster is mostly not usable.

or are we missing something??

Hi willthrom. Were you able to find answers to your questions?
I recently faced the exact same issue with 2 nodes during our failover tests on Couchbase 6.5. We also tried with 3 nodes.
What is the point of replication if data can be queried from the working node.
We receive “Error performing bulk get operation - cause: unable to complete action after 6 attemps” when SDK query the working node until the auto-failover.
How this can be avoided?