Hi,
After we add a new node to a cluster, then perform rebalancing our applications start throwing Timeouts exceptions with RetryReasons: KvNotMyVBucket
. Note: In my local cluster working in Docker I was able to reproduce it only 2 times, but it’s always reproduced in remote environments.
Cluster and SDK information:
- SDK version: 3.3.2
- Cluster version: 7.1.2 build 3454
- There are 4 nodes in the cluster (3 in my local env)
- Every bucket configured with 2 replicas
According to the logs from my local env, a new node can’t be initialized because SelectBucket
operations fails with KeyNotFound
error:
15:47:17 dbug: Couchbase.Core.ClusterNode[104] Op failed: Couchbase.Core.IO.Operations.Authentication.SelectBucket Couchbase.Core.Exceptions.KeyValue.DocumentNotFoundException: Exception of type ‘Couchbase.Core.Exceptions.KeyValue.DocumentNotFoundException’ was thrown. at Couchbase.Core.ClusterNode.ExecuteOp(Func`4 sender, IOperation op, Object state, CancellationTokenPair tokenPair)
-----------------------Context Info---------------------------
{
“dispatchedFrom”:“e0d74601479e”,
“dispatchedTo”:“172.16.101.14:11207”,
“documentKey”:“travel-sample-magma”,
“clientContextId”:“705270”,
“cas”:0,
“status”:“keyNotFound”,
“bucketName”:“travel-sample-magma”,
“collectionName”:null,
“scopeName”:null,
“message”:“KV Error: {Name=\u0022KEY_ENOENT\u0022, Description=\u0022Not Found\u0022, Attributes=\u0022item-only\u0022}”,“opCode”:“selectBucket”,“retryReasons”:
}
The most interesting thing happens next after the error. The SDK tries to perform a connection reinitialization:
15:47:17 dbug: Couchbase.Core.ClusterNode[0] Starting connection reinitialization on server 172.16.101.14:11207.
15:47:17 dbug: Couchbase.Core.ClusterNode[100] Executing op Helo on 172.16.101.14:11207 with key {“i”:“bf9049a231a53c4c/3f5737c6d5a269d2”,“a”:“couchbase-net-sdk/3.3.2.0 (clr/.NET 6.0.13) (os/Linux 5.10.102.1-microsoft-standard-WSL2 #1 SMP Wed Mar 2 00:30:59 UTC 2022)”} and opaque 705376.
And after this I can’t find anything related to this opaque 705376
, probably meaning that we never receive a response from the cluster. And seems that a cancellation token isn’t passed to the reinitialization flow in the ClusterNode
:
public async Task HelloHello()
{
foreach (var connection in ConnectionPool.GetConnections())
{
...
using var rootSpan = RootSpan("reinitialize_connection");
var serverFeatureList = await Hello(connection, rootSpan).ConfigureAwait(false);
...
}
}
After this I don’t see records related to the cluster nodes updates like this:
15:47:17 dbug: Couchbase.CouchbaseBucket[7000] Updating cluster nodes for rev#9090 on travel-sample-magma
And seems like I ended up in the situation when cluster vbucket map indicates that some vbuckets are on the new node, but this node isn’t the cluster node list, so SDK takes a random node and sends a request to it which definitely responses with NotMyVBucket error. After 6 attempts to retry the operation fails.
I’m going to test the same with the latest SDK version, but that would be great to know what might cause this issue in 3.3.2 and if there are any workarounds.
The link to the archive with two logs files when I was able to reproduce the issue: