We recently upgraded our clusters. This was in our production environment. As I understand the process, one node (to be upgraded) is marked as ‘failover’ and another (upgraded) node is introduced into the cluster and the rebalancing occurs. Once rebalancing is complete, the failover node is removed and/or upgraded.
The behavior we saw during this upgrade was that during rebalancing requests for documents were still being sent to the ‘failover’ node. This resulted in a large number of document fetch timeouts, which are errors in our application. This problem would persist until we restarted the application. If we restarted the application during rebalancing everything was fine, until the next node was marked ‘failover’ and another node introduced to the cluster, causing rebalancing. Then document requests (for documents that the master thought should be on the ‘failover’ node) would be sent to the ‘failover’ node, resulting in timeout errors.
Are there any known issues about the behavior of the client during rebalance events?
We upgraded from 3.0.3 to 4.1.1, the client is 2.1.4.
When performing a routine upgrade in a production environment you should not ‘Fail Over’ the node you intend to replace but instead press ‘Remove Server’. You can see more info on this in the documentation here: http://docs.couchbase.com/admin/admin/Install/upgrade-online.html under the section titled ‘With an extra node available’. This process is known as a ‘swap rebalance’ and is much more efficient than failing over a node and then adding one.
There a few issues where a node can technically be failed over and clients will not notice. I seem to recall one of those being caused by old versions of various clients having stale vbucket maps, although I would expect not-my-vbucket errors rather than timeouts. It’s definitely worth updating your client as irrespective of whichever SDK you’re using you’re a few versions behind.
I double checked with our DBAs and I described the process incorrectly. They did a swap re-balance per the link you provided. Given that, are there any known issues with the behavior of the client during swap re-balance?