I have a multi-threaded process that does batch processing of documents in Couchbase with large numbers of bucket.Get calls, and once in a while during high intensity operations Couchbase would stop responding to the client during a certain batch of Get calls. This would last until the SendTimeout value of the PoolConfiguration is exceeded, and then return a batch of OperationTimeout responses. What’s weird is afterwards the remaining Get calls would still fail, with a combination of OperationTimeouts and ClientFailures, where the ClientFailures have an error message along the lines of “but got: { partial contents of the JSON document }”.
I have looked into different connection pool configurations, and have set up multiplexing IO as it suggests to help with high numbers of concurrent connections (which seemed to help for a while?), but eventually this problem started happening again.
Does anyone have any idea where I should look into to solve this issue? Thanks!
Example code on how you’re performing the batch get requests
Example number of parallel requests with average document size
If you’re using a custom Parallel.ForEach loop with custom ParallelOptions overriding the MaxDegreeOfParallelism, for example, it can cause threads to block to a point where there aren’t resources to process a response and the requests will eventually timeout.
Also, if you have the opportunity, the Async API (eg GetAsync) is more performant & reliable when dealing with many parallel requests because is based on Tasks which are better at managing resources.
Hi @MikeGoldsmith,
Our client configuration is as follows, where the servers are 3 EC2 instances:
public static readonly ClientConfiguration COUCHBASE_CLUSTER_CONFIG = new ClientConfiguration()
{
Servers = COUCHBASE_HOSTNAMES,
UseSsl = false,
DefaultOperationLifespan = 5000,
Serializer = () =>
{
var defaultSerializer = new JsonSerializerSettings() {ContractResolver = new DefaultContractResolver()};
return new DefaultSerializer(defaultSerializer, defaultSerializer);
},
BucketConfigs = new Dictionary<string, BucketConfiguration>
{
{
BUCKET_NAME, new BucketConfiguration
{
BucketName = BUCKET_NAME,
Password = "",
PoolConfiguration = new PoolConfiguration
{
SendTimeout = 15000
}
}
}
},
ConnectionPoolCreator = ConnectionPoolFactory.GetFactory<ConnectionPool<MultiplexingConnection>>(),
IOServiceCreator = IOServiceFactory.GetFactory<MultiplexingIOService>()
};
The Couchbase server is running 4.6.1 with the .NET SDK on 2.4.6.
Documents are simply being retrieved using the bucket.Get<T>(List<string>) function with no custom ParallelOptions used, and there could be around 20 threads each requesting in batches of up to 1000 documents each (inside while loops, with document IDs being supplied from Elasticsearch). Each document should be a few kilobytes in size on average.
I will try using GetAsync and see if that helps. Thanks!
Please can you upgrade the client, there are known issues regarding managing TCP connections on 2.4.6 which led to it being delisted on nuget.org. 2.4.8 is the latest release.
Is there a reason why you’re manually setting the ConnectionPoolCreator and IOServiceCreator to use Multiplexing? From 2.4.0 we defaulted to using Multiplexing connections, and from 2.4.5 we introduced the SharedConnectionPool to allow multiple connections per server. If you remove your config entries for both ConnectionPoolCreator and IOServiceCreator you will get those as default.
There’s no reason why I was manually setting the ConnectionPoolCreator, I must’ve followed instructions for older versions of the client when changing from the old connection pool method to multiplexing. I’ve removed those now, and also changed our code to use GetAsync instead of Get when performing bulk document retrievals.
Been doing some testing using only the GetAsync method for retrieving documents, and now there’s a different issue. During previously mentioned heavy loads the operations would now sometimes return with the status TransportFailure and the message The operation has timed out. This is then followed by a period of nothing but NodeUnavailable responses, where it complains that the first node in our list of 3 servers is down or unreachable. As far as I can tell on Couchbase from both the web interface and its logs, nothing abnormal happens on the Couchbase servers during this time.
What might be causing this and how come it only complains that the first node is unavailable when our Couchbase cluster is made up of 3 nodes? Do I have to implement the logic to retry the retrieval myself using something like bucket.GetFromReplicaAsync?
A NodeUnavailableException might happen if a large number (10) of IO related errors occurred within a threshold (500ms) - an example might be a when a TCP RST occurs (like when you try to download a webpage and the response is connection-reset). These are generally network related errors and likely temporary or sporadic (hence the TransportFailure status). The client should try to reconnect after brief sleep time (1000ms) and continue working as expected.
Yes, you can definitely fall back to a replica-read in this case. You should also enable logging and try to determine what is causing the TransportFailure to happen.
Thanks for that, I’m able to mostly mitigate this problem after implementing client retry procedures (at 1000ms intervals). I’m still curious as to why the failures are happening though, and whether there are anything we can do to reduce their occurrence. Both the client and the Couchbase cluster are hosted on AWS within the same region, so network connection and speeds shouldn’t be an issue.
Another note is that I sometimes get the response status of None during the periods of error, which I’ve also included as a status that warrants retry.
I’ve enabled logging and can see the following during the periods of which errors occur:
2017-09-04 10:07:30,727 [129] WARN Couchbase.IO.ConnectionBase - Handling disconnect for connection 72322f00-8a4a-488d-9924-d351daac6a3f: System.IndexOutOfRangeException: Index was outside the bounds of the array.
at Couchbase.IO.Converters.DefaultConverter.CopyAndReverse(Byte[] src, Int32 offset, Int32 length)
at Couchbase.IO.Converters.DefaultConverter.ToInt32(Byte[] buffer, Int32 offset)
at Couchbase.IO.MultiplexingConnection.ParseReceivedData()
at Couchbase.IO.MultiplexingConnection.ReceiveThreadBody()
2017-09-04 10:07:30,758 [42] WARN Couchbase.IO.ConnectionBase - Handling disconnect for connection 72322f00-8a4a-488d-9924-d351daac6a3f: System.ObjectDisposedException: Cannot access a disposed object.
Object name: 'System.Net.Sockets.Socket'.
at System.Net.Sockets.Socket.Send(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags, SocketError& errorCode)
at System.Net.Sockets.Socket.Send(Byte[] buffer, Int32 offset, Int32 size, SocketFlags socketFlags)
at Couchbase.IO.MultiplexingConnection.SendAsync(Byte[] request, Func`2 callback)
I have the logging level on WARN at the moment as I couldn’t see anything useful under DEBUG during brief testing, but the DEBUG level logs were huge so I might’ve missed something.
The exception you posted may be a bit misleading, since its likely the effect of an earlier exception. Note that the client itself does retry on certain cases, for example if the operation is not a mutation or if a CAS is used. If you want, you can send me a link to the DEBUG logs and I can investigate further.
@jmorris
Did this go investigated further, if so what was the outcome? I have a very similar issue at the moment where high loads result in timeouts. I consistently get it while loading the client directly from an unit test but also sometimes when loading an webapi that fetches products. I’m wondering if it is fixable or if you could recommend a strategy for handling it.
My test code:
[Fact]
public async Task TryToFailCouchbase()
{
var bucket = _state.Bucket;
var getBatch = new ActionBlock<List<ShortSku>>(async batch =>
{
var tasks = new Task<IDocumentResult<CbCatalogProduct>>[batch.Count];
var tasks2 = new Task<IDocumentResult<CbProductStock>>[batch.Count];
var tasks3 = new Task<IDocumentResult<CbProductChannels>>[batch.Count];
for (int i = 0; i < batch.Count; i++)
{
tasks[i] = bucket.GetDocumentAsync<CbCatalogProduct>(CbCatalogProduct.GetId(batch[i]));
tasks2[i] = bucket.GetDocumentAsync<CbProductStock>(CbProductStock.GetId(batch[i]));
tasks3[i] = bucket.GetDocumentAsync<CbProductChannels>(CbProductChannels.GetId(batch[i]));
}
var ret = await Task.WhenAll(tasks).ConfigureAwait(false);
var ret2 = await Task.WhenAll(tasks2).ConfigureAwait(false);
var ret3 = await Task.WhenAll(tasks3).ConfigureAwait(false);
if (!ret.All(r => r.Success))
throw ret.First(r => !r.Success).Exception;
},
new ExecutionDataflowBlockOptions
{
BoundedCapacity = 50,
MaxDegreeOfParallelism = 8,
}
);
var batchMaker = new KeyBatches();
while (true)
{
await getBatch.SendAsync(batchMaker.MakeFull()).ConfigureAwait(false);
if (getBatch.Completion.IsFaulted)
{
throw getBatch.Completion.Exception;
}
}
}
Do we really need to nest async within async? I would suggest not using any of the Dataflow API components with the SDK. The SDK is highly IO bound, once you start specifying MaxDegreeOfParalleism we are using concepts related to CPU bound operations (crunching numbers, etc). You can easily batch without the overhead of the Dataflow API.
We are executing all of this in a loop that only completes on failure? Is that a realistic use-case?
Are you using the default configuration or tuning it for the expected work load? Although, given the non-terminating loop, you’ll bottleneck somewhere - app memory, hardware, network.
I’m very much new to this, so I might be approaching this from the wrong angle. What I would like to achieve is some maner of control/monitoring over what happens when we can’t handle the load.
What I’m trying to do with the test is to recreate the error that I get in my webApi when stressing it. In that scenario I have a AspNetCore Webapi with default settings couchbase that does a similar fetch of documents on a per request basis. When loading this with sequential requests it eventually fails in the same way for me. So I’m concerned about what I can do about it, should I just test to find our limits and monitor our traffic trends to be able to act when the need grows?