Hello all,
We had a Cluster with 3 nodes running Couchbase CE 7.1.0 on AWS EC2, with a replication factor of 2.
Once the disk on node-3 was full and its been auto failed-over, some hours later our node-1 (master) went down either, but not from a full disk:
node-1: IP ending with 124
node-2: IP ending with 125
node-3: IP ending with 51
We increased node-3’s EBS volume in an attempt to recover the cluster’s normal state.
As we tried to REBALANCE, nothing happened. We tried then to manually FAILOVER node-1, then a message came up saying that this particular node (1) was not responsive, and asked if we wanted to continue with the process. We allowed the FAILOVER of node-1 after more failed tries to REBALANCE.
After node-1 was ‘successfully’ failed-over, we were expecting that the node would be in a failed-over state still within the Cluster and that it could be RECOVERED as it was with node-3. That didn’t happen, as node-1 was removed from the Cluster.
The last LOG emitted from node-1 was:
Service 'memcached' exited with status 137. Restarting. Messages:
W0608 19:59:10.099933 1160 HazptrDomain.h:670] Using the default inline executor for asynchronous reclamation may be susceptible to deadlock if the current thread happens to hold a resource needed by the deleter of a reclaimable object
W0616 14:59:29.169379 1160 HazptrDomain.h:670] Using the default inline executor for asynchronous reclamation may be susceptible to deadlock if the current thread happens to hold a resource needed by the deleter of a reclaimable object
W0623 17:35:11.380251 1150 HazptrDomain.h:670] Using the default inline executor for asynchronous reclamation may be susceptible to deadlock if the current thread happens to hold a resource needed by the deleter of a reclaimable object
W0629 11:08:37.452235 1161 HazptrDomain.h:670] Using the default inline executor for asynchronous reclamation may be susceptible to deadlock if the current thread happens to hold a resource needed by the deleter of a reclaimable object
W0702 04:47:49.334327 1161 HazptrDomain.h:670] Using the default inline executor for asynchronous reclamation may be susceptible to deadlock if the current thread happens to hold a resource needed by the deleter of a reclaimable object
And about 4 seconds later, on node-2:
Node 'ns_1@XXX.XXX.XXX.125' saw that node 'ns_1@XXX.XXX.XXX.124' went down. Details: [{nodedown_reason,
connection_closed}]
Haven't heard from a higher priority node or a master, so I'm taking over.
Enabled auto-failover with timeout 120 and max count 1
Could not auto-failover node ('ns_1@XXX.XXX.XXX.124'). Number of remaining nodes that are running data service is 1. You need at least 2 nodes.
We’d like to know if there is any way we could recover node-1, putting it inside the Cluster again. Or the only way out we have is to “ADD SERVER” node-1 again?
I think it is important to say that the WEB UI on node-1 is not accessible anymore.
Any help is more than welcome.