Errors on production application during cluster upgrade

knewsom · January 4, 2017, 5:02pm

We recently upgraded our clusters. This was in our production environment. As I understand the process, one node (to be upgraded) is marked as ‘failover’ and another (upgraded) node is introduced into the cluster and the rebalancing occurs. Once rebalancing is complete, the failover node is removed and/or upgraded.

The behavior we saw during this upgrade was that during rebalancing requests for documents were still being sent to the ‘failover’ node. This resulted in a large number of document fetch timeouts, which are errors in our application. This problem would persist until we restarted the application. If we restarted the application during rebalancing everything was fine, until the next node was marked ‘failover’ and another node introduced to the cluster, causing rebalancing. Then document requests (for documents that the master thought should be on the ‘failover’ node) would be sent to the ‘failover’ node, resulting in timeout errors.

Are there any known issues about the behavior of the client during rebalance events?

We upgraded from 3.0.3 to 4.1.1, the client is 2.1.4.

will.gardner · January 4, 2017, 11:58pm

When performing a routine upgrade in a production environment you should not ‘Fail Over’ the node you intend to replace but instead press ‘Remove Server’. You can see more info on this in the documentation here: http://docs.couchbase.com/admin/admin/Install/upgrade-online.html under the section titled ‘With an extra node available’. This process is known as a ‘swap rebalance’ and is much more efficient than failing over a node and then adding one.

There a few issues where a node can technically be failed over and clients will not notice. I seem to recall one of those being caused by old versions of various clients having stale vbucket maps, although I would expect not-my-vbucket errors rather than timeouts. It’s definitely worth updating your client as irrespective of whichever SDK you’re using you’re a few versions behind.

knewsom · January 6, 2017, 1:48am

I double checked with our DBAs and I described the process incorrectly. They did a swap re-balance per the link you provided. Given that, are there any known issues with the behavior of the client during swap re-balance?

daschl · January 6, 2017, 7:39am

If you are using Java 2.1.4, this one is quite old at this point and has known bugs. Can you upgrade to 2.3.6/2.4.0?

Topic		Replies	Views
Failure during rebalance Couchbase Server	5	5662	July 2, 2013
Node failure blocks Java client Java SDK	12	4966	April 5, 2017
Cluster issue after failing over and re-adding node Couchbase Server query	3	1349	October 4, 2019
Rebalance causing lots of NodeUnavailable and OperationTimeout Errors Couchbase Server	2	780	March 24, 2020
Rebalancing fails 2.5.1. Windows azure Couchbase Server	14	2818	January 29, 2015

Errors on production application during cluster upgrade

Related topics