I was able to reproduce the problem of couchbase SDK not reestablishing a connection. Here is how.
I created a couchbase client which connects to a bucket and try a read operation every 10 secs.
Configuration:
2 Node Couchbase cluster 2.2
1.4.9 Couchbase SDK
Test #1:
Fair close of a connection to the couchbase without killing the process.
Run the java client which gets a document every 10 sec
Find pid owned by process – sudo netstat -tanp | grep 11210
Find file descriptor for the port – lsof -np <pid> | grep <local_port>
Find the fd which is nothing but the 3rd column with u removed
Connect to process via gdb – gdp -p <pid>
gdb>call shutdown(<fd>)
gdb>cont
Everything worked fine and the sdk reconnected and continued the 10 sec loop of getting the document.
Logs>
2015-06-16 05:31:18.383 WARN net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl: Discarding partially completed op: Cmd: 10 Opaque: 21 2015-06-16 05:31:18.383 WARN net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl: Discarding partially completed op: Cmd: 0 Opaque: 22 Key: 00048938-46a6-4c3a-8b65-3f1034a171c0 2015-06-16 05:31:18.384 INFO com.couchbase.client.CouchbaseConnection: Reconnecting due to exception on {QA sa=cqzsdb02.xyz.com/10.84.80.107:11210, #Rops=0, #Wops=0, #iq=0, topRop=null, topWop=null, toWrite=0, interested=1} java.io.IOException: Disconnected unexpected, will reconnect. at net.spy.memcached.MemcachedConnection.handleReadsWhenChannelEndOfStream(MemcachedConnection.java:911) at net.spy.memcached.MemcachedConnection.handleReads(MemcachedConnection.java:823) at net.spy.memcached.MemcachedConnection.handleReadsAndWrites(MemcachedConnection.java:720) at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:683) at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:436) at com.couchbase.client.CouchbaseConnection.run(CouchbaseConnection.java:325) 2015-06-16 05:31:18.385 WARN com.couchbase.client.CouchbaseConnection: Closing, and reopening {QA sa=cqzsdb02.xyz.com/10.84.80.107:11210, #Rops=0, #Wops=0, #iq=0, topRop=null, topWop=null, toWrite=0, interested=1}, attempt 0. 05:31:18,386 ERROR Thread-10 javatut.CouchbaseConnIncreaseTest:105 - Error occurred is java.util.concurrent.CancellationException: Cancelled 2015-06-16 05:31:20.389 INFO com.couchbase.client.CouchbaseConnection: Reconnecting {QA sa=cqzsdb02.xyz.com/10.84.80.107:11210, #Rops=0, #Wops=0, #iq=0, topRop=null, topWop=null, toWrite=0, interested=0} 2015-06-16 05:31:20.495 INFO net.spy.memcached.auth.AuthThread: Authenticated to cqzsdb02.xyz.com/10.84.80.107:11210
Test #2:
Unfair close of a connection to the couchbase without killing the process.
Run the java client which gets a document every 10 sec
Find pid owned by process – sudo netstat -tanp | grep 11210
Find file descriptor for the port – lsof -np <pid> | grep <local_port>
Find the fd which is nothing but the 3rd column with u removed
Connect to process via gdb – gdp -p <pid>
gdb>call shutdown(<fd>)
gdb>cont
SDK went into an infinite loop of retries and got the document after 3hr 28min!
Logs>
05:44:33,204 ERROR Thread-10 javatut.CouchbaseConnIncreaseTest:105 - Error occurred is net.spy.memcached.OperationTimeoutException: Timeout waiting for value: waited 2,500 ms. Node status: Connection Status { cqzsdb02.xyz.com/10.84.80.107:11210 active: true, authed: true, last read: 13,906 ms ago cqzsdb01.xyz.com/10.84.80.106:11210 active: true, authed: true, last read: 717 ms ago }
05:44:45,719 ERROR Thread-10 javatut.CouchbaseConnIncreaseTest:105 - Error occurred is net.spy.memcached.OperationTimeoutException: Timeout waiting for value: waited 2,500 ms. Node status: Connection Status { cqzsdb02.xyz.com/10.84.80.107:11210 active: true, authed: true, last read: 26,422 ms ago cqzsdb01.xyz.com/10.84.80.106:11210 active: true, authed: true, last read: 4,221 ms ago }
05:44:58,220 ERROR Thread-10 javatut.CouchbaseConnIncreaseTest:105 - Error occurred is net.spy.memcached.OperationTimeoutException: Timeout waiting for value: waited 2,500 ms. Node status: Connection Status { cqzsdb02.xyz.com/10.84.80.107:11210 active: true, authed: true, last read: 38,923 ms ago cqzsdb01.xyz.com/10.84.80.106:11210 active: true, authed: true, last read: 4,988 ms ago }
2015-06-17 02:13:18.442 WARN com.couchbase.client.CouchbaseConnection: sun.nio.ch.SelectionKeyImpl@6fc757 exceeded continuous timeout threshold
2015-06-17 02:13:18.442 WARN com.couchbase.client.CouchbaseConnection: Closing, and reopening {QA sa=cqzsdb02.xyz.com/10.84.80.107:11210, #Rops=0, #Wops=2996, #iq=0, topRop=null, topWop=Cmd: 10 Opaque: 51, toWrite=0, interested=4}, attempt 0.
2015-06-17 02:13:18.443 WARN net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl: Discarding partially completed op: Cmd: 10 Opaque: 51
2015-06-17 02:13:18.443 WARN net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl: Discarding partially completed op: Cmd: 0 Opaque: 52 Key: 00048938-46a6-4c3a-8b65-3f1034a171c0
2015-06-17 02:13:18.444 WARN net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl: Discarding partially completed op: Cmd: 10 Opaque: 54
2015-06-17 02:13:18.444 WARN net.spy.memcached.protocol.binary.BinaryMemcachedNodeImpl: Discarding partially completed op: Cmd: 10 Opaque: 56
2015-06-17 02:13:20.445 INFO com.couchbase.client.CouchbaseConnection: Reconnecting {QA sa=cqzsdb02.xyz.com/10.84.80.107:11210, #Rops=0, #Wops=0, #iq=1, topRop=null, topWop=null, toWrite=0, interested=0}
This goes infinitely and I get the following exception after along time - exactly 3hour 28minutes . This is a very long time . After this SDK is able to reestablish the connection but how do you reduce this time to a lesser value like 2 minutes.I have not used any tuning setting from SDK documentation.
Problems:
Why is couchbase client sdk taking so long to connect back.