We see the errors below in our logs for roughly 3% of our calls to Couchbase. Errors appear periodically and generally within 10-15 seconds or less of each other. They can go on for several minutes to hour then nothing for many hours. It seems network related but our network team can’t find anything. TCP keep-alive is set at 60 seconds on the Windows Server. What else can I provide to troubleshoot?
Couchbase.Core.NodeUnavailableException: The node 192.168.126.82:11210 that the key was mapped to is either down or unreachable. The SDK will continue to try to connect every 1000ms. Until it can connect every operation routed to it will fail with this exception. for CacheKey : PromotionData_PromotionNumber:block at Fanatics.Core.Caching.Couchbase.CouchBaseProvider.HandleFailedResponse(String cacheKey, IResult couchbaseResult, ResponseStatus status) in g:\Jenkins\workspace\Core.Caching\Fanatics.Core.Caching.Couchbase\CouchBaseProvider.cs:line 130 at Fanatics.Core.Caching.Couchbase.CouchBaseProvider.Put(String cacheKey, String clusterName, String bucketName, Object cachedItem, TimeSpan lifetime) in g:\Jenkins\workspace\Core.Caching\Fanatics.Core.Caching.Couchbase\CouchBaseProvider.cs:line 943 at Fanatics.Core.Caching.JsonCacheClientSlim.Put[T](Object parameters, T item) in g:\Jenkins\workspace\Core.Caching\Fanatics.Core.Caching\JsonCacheClientSlim.cs:line 696 at Promo.Services.Controllers.CacheController.ReadAndUpdateCache(String promotionNumber, String clientName) in g:\Jenkins\workspace\Promo\PromoTools\Promo.Services\Controllers\cache\CacheController.cs:line 100
System.IO.IOException: The connection has timed out while an operation was in flight. The default is 15000ms. at Couchbase.IO.Connection.Send(Byte[] buffer) at Couchbase.IO.Strategies.DefaultIOStrategy.Execute[T](IOperation`1 operation) for CacheKey : PromotionData_PromotionNumber:ticktock at Promo.Services.Controllers.CacheController.ReadCache(String promotionNumber, String clientName) in g:\Jenkins\workspace\Promo\PromoTools\Promo.Services\Controllers\cache\CacheController.cs:line 69 at Promo.Services.Controllers.PromotionDetailController.Get(String promotionNumber, String clientName) in g:\Jenkins\workspace\Promo\PromoTools\Promo.Services\Controllers\public\PromotionDetailController.cs:line 40
System.Net.Sockets.SocketException (0x80004005): An existing connection was forcibly closed by the remote host at Couchbase.IO.Connection.Send(Byte[] buffer) at Couchbase.IO.Strategies.DefaultIOStrategy.Execute[T](IOperation`1 operation) for CacheKey : PromotionData_PromotionNumber:pannant at Promo.Services.Controllers.CacheController.ReadCache(String promotionNumber, String clientName) in g:\Jenkins\workspace\Promo\PromoTools\Promo.Services\Controllers\cache\CacheController.cs:line 69 at Promo.Services.Controllers.PromotionDetailController.Get(String promotionNumber, String clientName) in g:\Jenkins\workspace\Promo\PromoTools\Promo.Services\Controllers\public\PromotionDetailController.cs:line 40
I have been having the same issue over the last week with a Windows Service that was recently upgraded to .Net client 2.1.3. Restarting the service fixes the problem when it occurs.
It appears to be specific to the running application, none of the other applications have the problem at the same time. In fact, the exact same service running on other servers in the the farm against the exact same cluster will be working fine. This makes me believe that it’s an issue in the .Net client.
This is running against a 3.1 cluster running on Amazon Linux. The client service is running on Windows Server 2012 and .Net 4.5.2. No rebalance is being performed on the cluster when the errors occur.
I have also reviewed the commits for client 2.1.4 and don’t see any changes since 2.1.3 that look like they would prevent this problem. However, it looks like this commit NCBC-968: NRE when master node cannot be obtained during Observe · couchbase/couchbase-net-client@30e46f0 · GitHub might change the error behavior and make it return Success = false and an Exception object, rather than throwing an exception. Not sure if I’m reading that right, though.
If a TCP connection is reset, the client will put that node (the client representative, not the actual server node) into a temporary “down” state while it tries to reconnect. While this is happening, if a key is mapped to that node it will return with a NodeUnavailableException in the Exception field of the result (note the exception is always handled before the client returns control back to the application), however it will be logged.
The purpose is if a node in the cluster does go completely down, the client won’t waste cycles sending requests that will timeout; it will fail fast with that error while it tries to reconnect. If it can reconnect, then it will go back “online” and start processing requests (again this is the client object representing a remote node in a cluster).
Unfortunately, the criteria for determining when to put a node into a down state is simply too generic in 2.1.3; basically any IO error could put the client in this state. However, this was improved significantly in 2.2.0 and I suggest you update to the latest.
We updated to 2.2.1 and the issue persists. My team will continue to look into the issue and respond if we find anything. Has anyone else experienced this error with 2.2.1?
If it’s the exact same stacktrace and exception, I would look into why the remote host (or perhaps LB or other net appliance) is closing the connections. Also, is it the same node every time? Random nodes? etc.
Shortly after I reported the error we worked around the issue by setting a low timeout for the Couchbase response. If the response timed out after a few hundred milliseconds we treated the failure as if the key did not exist in Couchbase and moved on. Since then we have upgraded from v3.0.3 to v3.1.2 and honestly the error has not been on the top of our priority list because the workaround suites our needs.
That being said another team in our organization observed an application trying to retrieve a bad cache key (length 268), which resulted in connectivity errors to be logged for the next hour. That doesn’t explain all of our errors but might be noteworthy.