Manually failed-over node removed from Cluster

jose.venancio · July 3, 2023, 11:28pm

Hello all,

We had a Cluster with 3 nodes running Couchbase CE 7.1.0 on AWS EC2, with a replication factor of 2.

Once the disk on node-3 was full and its been auto failed-over, some hours later our node-1 (master) went down either, but not from a full disk:

node-1: IP ending with 124
node-2: IP ending with 125
node-3: IP ending with 51

We increased node-3’s EBS volume in an attempt to recover the cluster’s normal state.

As we tried to REBALANCE, nothing happened. We tried then to manually FAILOVER node-1, then a message came up saying that this particular node (1) was not responsive, and asked if we wanted to continue with the process. We allowed the FAILOVER of node-1 after more failed tries to REBALANCE.

After node-1 was ‘successfully’ failed-over, we were expecting that the node would be in a failed-over state still within the Cluster and that it could be RECOVERED as it was with node-3. That didn’t happen, as node-1 was removed from the Cluster.

The last LOG emitted from node-1 was:

Service 'memcached' exited with status 137. Restarting. Messages:
W0608 19:59:10.099933 1160 HazptrDomain.h:670] Using the default inline executor for asynchronous reclamation may be susceptible to deadlock if the current thread happens to hold a resource needed by the deleter of a reclaimable object
W0616 14:59:29.169379 1160 HazptrDomain.h:670] Using the default inline executor for asynchronous reclamation may be susceptible to deadlock if the current thread happens to hold a resource needed by the deleter of a reclaimable object
W0623 17:35:11.380251 1150 HazptrDomain.h:670] Using the default inline executor for asynchronous reclamation may be susceptible to deadlock if the current thread happens to hold a resource needed by the deleter of a reclaimable object
W0629 11:08:37.452235 1161 HazptrDomain.h:670] Using the default inline executor for asynchronous reclamation may be susceptible to deadlock if the current thread happens to hold a resource needed by the deleter of a reclaimable object
W0702 04:47:49.334327 1161 HazptrDomain.h:670] Using the default inline executor for asynchronous reclamation may be susceptible to deadlock if the current thread happens to hold a resource needed by the deleter of a reclaimable object

And about 4 seconds later, on node-2:

Node 'ns_1@XXX.XXX.XXX.125' saw that node 'ns_1@XXX.XXX.XXX.124' went down. Details: [{nodedown_reason,
connection_closed}]

Haven't heard from a higher priority node or a master, so I'm taking over.

Enabled auto-failover with timeout 120 and max count 1

Could not auto-failover node ('ns_1@XXX.XXX.XXX.124'). Number of remaining nodes that are running data service is 1. You need at least 2 nodes.

We’d like to know if there is any way we could recover node-1, putting it inside the Cluster again. Or the only way out we have is to “ADD SERVER” node-1 again?

I think it is important to say that the WEB UI on node-1 is not accessible anymore.

Any help is more than welcome.

jose.venancio · July 4, 2023, 12:37pm

I’d just like to add that today, our node-2 (master) died too, apparently from the same reason a node-1:

Service 'memcached' exited with status 137. Restarting. Messages:
[*** LOG ERROR #821438 ***] [2023-06-26 23:59:43] [spdlog_file_logger] {Failed writing to file /opt/couchbase/var/lib/couchbase/logs/memcached.log.000248.txt: No space left on device}
[*** LOG ERROR #821439 ***] [2023-06-26 23:59:56] [spdlog_file_logger] {Failed writing to file /opt/couchbase/var/lib/couchbase/logs/memcached.log.000248.txt: No space left on device}
W0627 00:25:49.059343 1145 HazptrDomain.h:670] Using the default inline executor for asynchronous reclamation may be susceptible to deadlock if the current thread happens to hold a resource needed by the deleter of a reclaimable object
W0701 09:01:57.970530 1146 HazptrDomain.h:670] Using the default inline executor for asynchronous reclamation may be susceptible to deadlock if the current thread happens to hold a resource needed by the deleter of a reclaimable object
W0703 15:12:51.502624 1145 HazptrDomain.h:670] Using the default inline executor for asynchronous reclamation may be susceptible to deadlock if the current thread happens to hold a resource needed by the deleter of a reclaimable object

I know that there is a “No space left on device” message on the log, but I checked on the machine itself, and absolutely no file system is even near full.

Last night we managed to FULL RECOVERY the node-3 (auto failed-over) so now we are redirecting traffic to it. Node-3 is our last one, if he goes down too, I don’t know what else to do.

perry · July 5, 2023, 12:10pm

Hi @jose.venancio, sorry for the issues you’re having here. To set the right expectations, troubleshooting these sort of issues can be pretty challenging without looking at the full set of logs for context. I’ll do my best to help…

You mentioned that you have 2 replicas, so that means you will be able to sustain the loss of two nodes and still have a full copy of the data. So in this case, once node-1 and node-2 are failover over, all of the data resides on node-3 and your application should be functioning correct? Any data on the nodes that were failed over is no longer needed and so you should bring them back into the cluster as brand-new nodes. I would suggest uninstalling/re-installing Couchbase on those two nodes or even starting with new instances.

You should then be able to add them into the cluster (i.e. the single node that is still alive) and rebalance them in to return to a healthy state.

As a best practice, we recommend separating the disks (or disk partitions) between the installation directory (i.e. /opt/couchbase/xxx) and the data directory. By default they are in the same place, but you can configure the data directory separately for each node when adding it to the cluster. That way, even if the data grows and fills up the disk, it won’t impact the functioning and stability of the cluster.

jose.venancio · July 5, 2023, 12:47pm

Hello @perry, that’s what we end up doing. We cleaned up node 1 and 2, reinstalled couchbase and re-inserted them in the cluster. The major problem now is that we’ll have to recreate node’s 1 and 2 indexes again, one by one.

About that error that make the node dies, it happened today too, and we saw the node memory increasing until it was about 98% and then it died. Thank God it wasn’t the master node (node-3), but node-2.

We were running those nodes on r5.2xlarge (64GiB RAM) EC2 instances, but now (we already have one) we are planning to change the rest of the nodes to r5.4xlarge (128GiB RAM).

This is odd because it didn’t happened before, this 3-node cluster (r5.2xlarge) was running for a year with no major problems, but it seems that after we’ve been doing successive EBS volume increases (was originally 460GiB now its 1TiB on each node), that problem started to happen (specifically after the 1TiB increase).

Our memory quotas are:

Screenshot_20230705_092040

Which is below the 90% recomendation (about 85%). If it helps, that memory increase happens only in the morning, the first day it happened about 9am, the second day about 8am, and today happened at about 7:30am (UTC-3).

Total disk used in the cluster is 1.17TiB.

all of the data resides on node-3 and your application should be functioning correct?

That is correct, the problem is that it (node-3) almost died today too with the memory problem (it went up to about 98% and then slowly decreased).

perry · July 6, 2023, 8:15am

Hi @jose.venancio, glad to hear you were able to recover and get back to a stable state. These sort of memory issues are notoriously difficult to diagnose. If you had a license to the Enterprise Edition, ideally we would want to open a support ticket and look more deeply into the logs.

If it does happen again, please try to observe which process is consuming the memory which will hopefully give some better indication of what might be causing it.

jose.venancio · July 7, 2023, 12:12pm

Hey @perry, the process consuming the memory is (supposedly) cbq-engine, but it is just a guess as the problem didn’t happen again (after upgrading to 128GiB).

I searched up on this process and it seems that other people had a similar problem. Apparently it is caused by queries, but as we’ve been using couchbase in production for a long time now, we have no idea which query it might be.

perry · July 7, 2023, 12:26pm

Thanks @jose.venancio. Yes, cbq-engine is the query processing engine. You can some insight into the query service and the queries that are running (or have failed): Monitor Queries | Couchbase Docs.

We’ve made a number of improvements to the query engine’s memory management over time. 7.1 is a pretty recent version so I’d expect you to already have most of that in place, but 7.2 was recently released and it’s always good to keep up with the latest version if you can.

If you do find this happening again, it’s generally safe to kill the cbq-engine process. It’s intended to be stateless and will re-spawn automatically. Any queries that were running at that moment would be cancelled in-flight but there shouldn’t be any other ill-effects and your application will continue functioning.

system · October 5, 2023, 12:26pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Newbie questions - failover, java sdk remove/persistence Couchbase Server	2	2390	July 4, 2015
Auto rebalance after node failure Couchbase Server	11	5557	May 17, 2017
Question on Recovering cluster Couchbase Server	4	1220	April 7, 2017
Couchbase-server failover removes node from cluster Couchbase Server server , couchbase-cli	8	857	September 28, 2023
Fails related to memcached but memcached not used Couchbase Server	0	1966	December 11, 2015

Manually failed-over node removed from Cluster

Related topics