Resource leaks when connection lost due to "exceeded continuous timeout threshold"

We have two primary applications that interact with Couchbase Server 3.0.1 Community Edition. One of them handles bulk upload into a cluster, and this past summer we updated that to use the 2.x Java client SDK for performance reasons. The other application is a Jetty-based web server that uses the 1.4.7 Java Client SDK to communicate with single instances of a CS server, and that’s the application I am concerned about right now. (Note that I see similar behavior with the latest 1.4.10 SDK as well.) It is not feasible to update that application right now to use the 2.x SDK. I have reproduced the behavior in a dedicated test application, but have not yet removed all of our proprietary code. In production, we connect to three buckets in our CS DB, but in the test environment, I reproduce the problem with a single bucket.

In a nutshell, in the test case, I create a single CouchbaseClient instance connected to a bucket, and I create a number of threads all performing simple key/value “GET” operations using the same key, over and over again. It is easy to establish a load which results in request timeouts and ultimately to the “connection” (it is an overloaded term…) being lost due to exceeding the continuous threshold limit. (Example log messages below.) When this happens, code within the CS client (and/or the memcached client) triggers a re-connect. We install an observer and log the connection-lost and connection-restored messages. When the test is over, we call the shutdown method on the client - but if the connection was lost and restored, shutdown fails because there are still MemcachedConnection threads (it extends SpyThread) running that are not tracked and cleaned up. I’ve seen these both in heap dumps via jmap as well as in the debugger in Eclipse in the test program.

I can also see the connection count increase through the CS UI when this happens, and can confirm via netstat that multiple connections to port 11210 exist for each time the high-level connection is lost and restored.

One of the messages we log is as follows.
[Memcached IO over {MemcachedConnection to csdbedge1.xxx.net/216.38.170.75:11210}] client.CouchbaseConnection - sun.nio.ch.SelectionKeyImpl@5e41108b exceeded continuous timeout threshold

That […] string is the name of the thread as set in the MemcachedConnection constructor.

The behavior between 1.4.7 and 1.4.10 is a little different, but in both cases, extra MemcachedConnection instances are created and not cleaned up. In 1.4.7 the net effect seemed to be an ever increasing number of connections, but in 1.4.10, it seems I hit a point where no new connections are established, but every subsequent request fails due to timeout.

During initial testing within the context of the jetty application, I was able to avoid the problem by starting with a light load and then increasing it slowly. However, if I slam the server with a high load (say, 500 requests per second spread over 500 http client threads) I see the problem quickly. In my test program, I have not tried any “warmup” phase.

Our timeoutExceptionThreshold is set to 15, rather than the default of 998, and I’m sure we should raise that, but that doesn’t change the fact that the connection recovery logic seems to leak tcp connections and threads.

I will continue to work to try to reproduce the problem in as simple a context as I can - and then share the code - but I’d like to know if anyone has run into a similar problem. I searched for some other timeout topics and didn’t find anything relevant.