Couchbase cluster reports frequent nodes down/up

leon.chadwick · February 21, 2017, 10:09am

We have a problem with a 3 node clusters running couchbase community 4.1.1.
A second such cluster running the same configuration does not exhibit this issue.

The cluster is regularly reporting nodes losing contact with each other, then in the same second, reporting that they are up again. This happens over and over, probably between every 2 to 6 minutes.
The third node of the cluster isn’t mentioned in the logs.
It’s suspect that the recovery is immediately after the lost connectivity.
Is there a way to determine if this is some underlying network issue or some problem with Couchbase clustering?

I had to increase the auto failover period to 90 seconds, as trying 60 or less just caused the cluster to get in a mess due to this problem. We want to run this cluster with much shorter failover period (e.g. 30 seconds) so this is a nuisance.

This is a copy from the admin web page logs showing 2 incidents of the problem I refer to:

Node ‘ns_1@serverA.nyk.mycompany.com’ saw that node ‘ns_1@serverB.nyk.mycompany.com’ came up. Tags: [] ns_node_disco004 ns_1@serverA.nyk.mycompany.com 04:42:28 - Tue Feb 21, 2017
Node ‘ns_1@serverB.nyk.mycompany.com’ saw that node ‘ns_1@serverA.nyk.mycompany.com’ came up. Tags: [] ns_node_disco004 ns_1@serverB.nyk.mycompany.com 04:42:28 - Tue Feb 21, 2017
Node ‘ns_1@serverA.nyk.mycompany.com’ saw that node ‘ns_1@serverB.nyk.mycompany.com’ went down. Details: [{nodedown_reason,
connection_closed}] ns_node_disco005 ns_1@serverA.nyk.mycompany.com 04:42:28 - Tue Feb 21, 2017
Node ‘ns_1@serverB.nyk.mycompany.com’ saw that node ‘ns_1@serverA.nyk.mycompany.com’ went down. Details: [{nodedown_reason,
net_tick_timeout}] ns_node_disco005 ns_1@serverB.nyk.mycompany.com 04:42:28 - Tue Feb 21, 2017

Node ‘ns_1@serverA.nyk.mycompany.com’ saw that node ‘ns_1@serverB.nyk.mycompany.com’ came up. Tags: [] ns_node_disco004 ns_1@serverA.nyk.mycompany.com 04:36:43 - Tue Feb 21, 2017
Node ‘ns_1@serverB.nyk.mycompany.com’ saw that node ‘ns_1@serverA.nyk.mycompany.com’ came up. Tags: [] ns_node_disco004 ns_1@serverB.nyk.mycompany.com 04:36:43 - Tue Feb 21, 2017
Node ‘ns_1@serverA.nyk.mycompany.com’ saw that node ‘ns_1@serverB.nyk.mycompany.com’ went down. Details: [{nodedown_reason,
connection_closed}] ns_node_disco005 ns_1@serverA.nyk.mycompany.com 04:36:43 - Tue Feb 21, 2017
Node ‘ns_1@serverB.nyk.mycompany.com’ saw that node ‘ns_1@serverA.nyk.mycompany.com’ went down. Details: [{nodedown_reason,
net_tick_timeout}] ns_node_disco005 ns_1@serverB.nyk.mycompany.com 04:36:43 - Tue Feb 21, 2017

leon.chadwick · February 28, 2017, 11:27am

Couchbase clustering logic has to be really bad, or really broken to be given a hugely generous 60 seconds timeout window and to still ALWAYS be failing over when the underlying servers are running with no issues.
What is wrong with it?

WillGardella · February 28, 2017, 6:20pm

Hi Leon,
Sorry you didn’t get a response earlier. What you’re seeing isn’t normal. You might check to see if a service on serverB is crashing and restarting. It’s also possible that one or more of your servers are overloaded and being slow to respond. It’s also possible that there’s something going wrong in your network config, but I would start by looking carefully at the logs on serverB.
-Will

Topic		Replies	Views
Couchbase 3.0.1 auto failover every week or so Couchbase Server	6	2821	January 27, 2016
CouchBase Node Goes down frequently Couchbase Server	4	2507	February 21, 2017
Checking if a Couchbase node is down Couchbase Server	1	2282	June 17, 2014
Application stops working immediately when 1 node goes down Java SDK connections , sdk	2	1046	July 10, 2018
Couch base cluster (version 3.0.1) does not come up after servers restart Couchbase Server	3	2486	July 13, 2016

Couchbase cluster reports frequent nodes down/up

Related topics