Unrecoverable error in socket timeouts to KeyValueEndpoint on 11210

Hello all,

We’re currently experiencing some intermittent errors in our java application which uses spring-data-couchbase (2.0.0.RELEASE). We’re running couchbase community edition 4.5.0-2601 on a four-node cluster.

What we’re seeing, is that every 3-7 days, is that whenever the following error shows up in our java logs, that it is basically the ‘kiss of death’, and that most database connectivity for key-value operations grinds to a halt from the java component that throws the error:

2017-07-07 08:13:09.107 WARN [cb-io-1-23] c.c.client.core.endpoint.Endpoint : [xxxxxxx/10.224.165.186:11210][KeyValueEndpoint]: Socket connect took longer than specified timeout.

After this comes a litany of ConcurrentTimeoutExceptions for any key-value operations. Bouncing the java application resolves the issue.

As some additional info, we have started keeping track of open tcp connections on port 11210 - what we have noticed is almost always have 16 connections open (which makes sense - kvServiceEndpoints = 4 * 4 couchbase nodes = 16) but once that log entry above shows up in the logs, we end up with a permanent maximum of 15 connections, which never recover.

Our config is as follows:

2017-07-07 17:51:16.314 INFO [main] com.couchbase.client.core.CouchbaseCore : CouchbaseEnvironment: {sslEnabled=false, sslKeystoreFile='null', sslKeystorePassword='null', queryEnabled=false, queryPort=8093, bootstrapHttpEnabled=true, bootstrapCarrierEnabled=true, bootstrapHttpDirectPort=8091, bootstrapHttpSslPort=18091, bootstrapCarrierDirectPort=11210, bootstrapCarrierSslPort=11207, ioPoolSize=40, computationPoolSize=40, responseBufferSize=16384, requestBufferSize=16384, kvServiceEndpoints=4, viewServiceEndpoints=1, queryServiceEndpoints=4, ioPool=NioEventLoopGroup, coreScheduler=CoreScheduler, eventBus=DefaultEventBus, packageNameAndVersion=couchbase-jvm-core/1.2.3 (git: 1.2.3), dcpEnabled=false, retryStrategy=BestEffort, maxRequestLifetime=300000, retryDelay=ExponentialDelay{growBy 1.0 MICROSECONDS; lower=100, upper=100000}, reconnectDelay=ExponentialDelay{growBy 1.0 MILLISECONDS; lower=32, upper=4096}, observeIntervalDelay=ExponentialDelay{growBy 1.0 MICROSECONDS; lower=10, upper=100000}, keepAliveInterval=30000, autoreleaseAfter=2000, bufferPoolingEnabled=true, tcpNodelayEnabled=true, mutationTokensEnabled=false, socketConnectTimeout=1000, dcpConnectionBufferSize=20971520, dcpConnectionBufferAckThreshold=0.2, queryTimeout=300000, viewTimeout=75000, kvTimeout=2500, connectTimeout=5000, disconnectTimeout=25000, dnsSrvEnabled=false}

Has anyone else faced similar issues?

@bsiggers would you be open to try a newer version of spring-data-couchbase (or ift hats not possible, at least manually bump the SDK version to something newer (like 2.4.6 or so)) and see if the issue persists?

If it does it might make sense to grab debug/trace logs for around that time and take a closer look to see whats causing the disruption.

Thanks @daschl - we’ll give updating spring-data-couchbase to 2.4.6 a try. It’s something we were also considering but because some of the APIs have changed we were putting it off due to some of the refactoring involved.