Couchbase-server failover removes node from cluster

penacho · June 28, 2023, 1:11pm

When running a Couchbase cluster of multiple nodes, 2 in this case, we notice that as of CB 7 a node is removed from the cluster upon performing a failover and can not be readded. This is different behavior than it was in CB 4.5…6.6.
This is seen on both CentOS7 and RockLinux 8.

Scenario:

2 nodes running CB 7.1.3 EE (just data service, one or more couchbase buckets)
1 node is stopped
a failover-hard is performed, either via couchbase-cli or API, or UI
Result
the failed node is removed from the cluster
the failed node can not be readded to the cluster
Expected result:
the failed node is marked as ‘unhealthy/inactiveFailed’
once the node is started, it can be readded to the cluster and data can be rebalanced.

Is this an intended change in behavior of the ‘failover’ functionality, or am I overlooking something?

perry · June 29, 2023, 10:23am

Hi @penacho - I’d say this isn’t expected but we’d likely need more information and logs to troubleshoot why it’s happening.

Can you provide the error message you get when trying to re-add it? If you have a support contract with us, it would be good to open a ticket so they can analyse the logs.

penacho · June 29, 2023, 3:48pm

Hi @perry,

Glad to here that this isn’t expected.
I’m a bit puzzled why failover/recovery doesn’t seem to work anymore the way it used to work. I have to admit that we did stick to version 4.6 for a long time and only recently moved through 5.1, 6.6 to 7.1. Probably I have missed something in the concept of failover/recovery being changed in the recent versions.

This is a simple scenario I used to test this:

(I’m not allowed to put the whole scenario in this post: 403 Forbidden ?!)

penacho · June 29, 2023, 4:00pm

tarting with an up&running 2-node cluster of CB 7.1.3 EE on CentOS7:

[rgr@cb7-a ~]# /opt/couchbase/bin/couchbase-cli server-list -c 127.0.0.1 -u Administrator -p admin123
ns_1@192.168.99.151 192.168.99.151:8091 healthy active
ns_1@cb7-a.infra.somewhere.com cb7-a.infra.somewhere.com:8091 healthy active

[rgr@cb7-a ~]# /opt/couchbase/bin/couchbase-cli bucket-list -c 127.0.0.1 -u Administrator -p admin123
conv_session_info
bucketType: membase
numReplicas: 1
ramQuota: 536870912
ramUsed: 331122208

After simulating a failed node/server by shutting it down, its state changes to unhealty, as expected:
(no auto-failover and such in place for this test)

[rgr@cb7-a ~]# /opt/couchbase/bin/couchbase-cli server-list -c 127.0.0.1 -u Administrator -p admin123
ns_1@192.168.99.151 192.168.99.151:8091 unhealthy active
ns_1@cb7-a.infra.somewhere.com cb7-a.infra.somewhere.com:8091 healthy active

Now I want to force a failover, so that the replica items on the remaining server are activated:
(trimming the curl command a bit, otherwise posting is not allowed)

curl /controller/failOver -d 'otpNode=ns_1@192.168.99.151'
HTTP/1.1 504 Gateway Time-out

Bummer, it fails (as the node can’t be reached), so try harder:

curl /controller/failOver -d 'otpNode=ns_1@192.168.99.151' -d allowUnsafe=true
HTTP/1.1 200 OK

That worked, but now the 2nd node is gone from the cluster

rgr@cb7-a ~]# /opt/couchbase/bin/couchbase-cli server-list -c 127.0.0.1 -u Administrator -p admin123
ns_1@cb7-a.infra.somewhere.com cb7-a.infra.somewhere.com:8091 healthy active

whereas with CB upto 6.6 it would remain in the cluster as ‘unhealthy inactiveFailed’

Being the node removed from the cluster, it can’t be added back into the cluster after it has been started again:

[rgr@cb7-a ~]# /opt/couchbase/bin/couchbase-cli recovery -c 127.0.0.1 -u Administrator -p admin123 --server-recovery 192.168.99.151
ERROR: Server not found 192.168.99.151:8091

BenHuddleston · June 30, 2023, 8:06am

Hi @penacho,

It looks like you’re attempting to do an unsafe (quorum) failover here:

curl /controller/failOver -d ‘otpNode=ns_1@192.168.99.151’ -d allowUnsafe=true
HTTP/1.1 200 OK

This type of failover is not a typical failover, and did not exist in older versions of Couchbase Server. It behaves slightly differently to a typical failover as it is designed to allow fewer than half of the nodes in a cluster to continue operation (with as much data as is available) after the majority of nodes in the cluster experience an issue. You can read more about it here.

In particular, I’d like to draw your attention to this section of the documentation on the consequences of unsafe failover. These consequences, I believe, are the issues that your are experiencing:

The nodes that have been failed over are also immediately removed from the cluster. They are, however, not informed of their removal; and so may continue to attempt to behave as if members of the cluster.

The failed over nodes cannot be recovered; and will therefore need to be re-initialized, if they are to be re-introduced into the cluster.

penacho · June 30, 2023, 9:11am

Interesting… thanks for pointing that out.

I’ve been reading mostly here to look for changes from previous releases:

and

I find that information is quite fragmented and scattered around, making it a bit hard to tie it all together.

Then my remaining issue is:

considering a two node cluster
one node fails for whatever reason (is expected to becom available after some time)
we want to failover that node to activate the replica entries on the remaining node
AFAIK this calls for a hard failover, but that fails:

[rgr@cb7-a ~]# curl <...>  http://127.0.0.1:8091/controller/failOver -d 'otpNode=ns_1@192.168.99.151'
< HTTP/1.1 504 Gateway Time-out

Sometimes the response is:

Cannot safely perform a failover at the moment

penacho · June 30, 2023, 9:49am

Hi @BenHuddleston,

If I understand this correctly, this implies that on a two-node cluster (yes, we know: not recommended/ideal, but we do use this) it is no longer possible a failover a node with the intention to add it back later when it is restored? This is a significant change compared to previous versions of CB.

Can this ‘quorum failure’ be overruled?
I looked for a setting, but haven’t found it yet

Interestingly, Perform Hard Failover | Couchbase Docs
uses screenshots with just two nodes in the cluster…

BenHuddleston · June 30, 2023, 10:35am

With a two node cluster when one node is unreachable and you attempt to fail it over this is the expected behaviour.

[rgr@cb7-a ~]# curl <…> http://127.0.0.1:8091/controller/failOver -d ‘otpNode=ns_1@192.168.99.151’
HTTP/1.1 504 Gateway Time-out

Sometimes the response is:
Cannot safely perform a failover at the moment

As a majority (more than half) of the nodes in a two node cluster is two, it is not possible to fail over one of the nodes in a safe and consistent manner due to the scenario described here. To allow such a failover would allow a potentially inconsistent state and inconsistent data to be served.

If I understand this correctly, this implies that on a two-node cluster (yes, we know: not recommended/ideal, but we do use this) it is no longer possible a failover a node with the intention to add it back later when it is restored?

It depends why the node being failed over is being failed over in this case. If the cluster management service on both nodes is alive and running, but the data service is not, then failing over one of the nodes will be possible. If, however, there is a network partition and the two nodes cannot communicate then it is not possible to fail over either node as a majority consensus (quorum) cannot be reached.

This is a significant change compared to previous versions of CB.

Indeed, but it improves metadata/data safety and consistency, and as you pointed out earlier, we recommend against two node clusters being deployed for reasons such as this.

Can this ‘quorum failure’ be overruled?

The only override for this case is the allowUnsafe option of the failover API which you have previously used. Unfortunately this does have the side effect of not allowing recovery of the node, again, for the sake of metadata/data safety and consistency.

system · September 28, 2023, 10:36am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Removing a node that hasn't came back Couchbase Server	2	653	September 10, 2018
Manually failed-over node removed from Cluster Couchbase Server	7	494	October 5, 2023
How do you rejoin the cluster? Couchbase Server	1	421	October 4, 2022
Couchbase cluster stuck after node failure Couchbase Server	2	2095	February 7, 2017
How many nodes could be removed from the cluster at once Couchbase Server	2	1327	July 27, 2018

Couchbase-server failover removes node from cluster

Related topics