I’m attempting integrate this with out monitoring system but am having difficulty understanding what and how a NodeDisconnectedEvent is triggered, in order to test this type of notification.
I’m running Couchbase 4.1.0-5005 Enterprise Edition (build-5005). I have tested with Java SDK 2.2.3 and 2.2.6
I’m running a 3 node cluster. After bringing up my environment I see the 3 NodeConnectedEvents on my event bus.
I then block access to one of the nodes by dropping all packets from/to it, from the client machine.
iptables -A INPUT -s IP_HERE -j DROP
iptables -A OUTPUT -d IP_HERE -j DROP
While trying to use the SDK, every third request times out. I don’t see a NodeDisconnectedEvent until about 20-25 minutes later.
Dropping packets is a bit different than terminating the connection. Is there a regular workload? The way we’ve approached this is that once we see a continuous number of timeouts to a given node (tuneable by a threshold), we drop and attempt to rebuild the connection at the client.
Normally this will happen within seconds or minutes, but it could take as long as 20-25 minutes later if there isn’t any workload.
This is a great test by the way. We do something like this regularly.
One way you can probably simulate a NodeDisconnectedEvent is if you kill the memcached process on one of the nodes. That will terminate the TCP connections (sending TCP FINs) and the client would then have to rebuild them.
Thanks for the information! I’m not doing a regular workload but am running some n1ql queries ad-hoc against the cluster after enabling the iptables firewall rules to block one of the nodes.
I’m trying to simulate the connection being broken from the perspective of the client SDK, for example, a firewall cutting a stale TCP connection, but not isolating the node the rest of the cluster, so am reluctant to kill the memcached process on the node.
the NodeDisconnectedEvent is triggerd the same time you’d see a node disconnect in the logs. That is when its internal state goes into DISCONNECTED from being previously CONNECTED. Most commonly this is the case when all sockets of a node go down (shutdown, failover, rebalance out).
We don’t do tcp level keepalive but the app level keepalive (sending various msgs over the app protocol in idle states) has no effect on this directly.
Make sure to not cut 1 stale TCP connection but rather perfom actual actions that will make the node removed, like a failover or a rebalance out with a node. if you just cut down one tcp socket the client will try to reconnect (since the node is still part of the server config) and you won’t see the event!
yes, that does align with what i am seeing. i was thinking the client would trigger this event on a broken tcp connection but that is not the case. i see the event when turning off the couchbase service, on failover, and removal.