I have a fairly small solution running on two cluster nodes. If one node fails (or is taken down for service) then the entire application fails!
Should it not be possible continue running? There is one node up and running fine - and it can just rebalance once the other node is back up and running. Should I do anything in particular in my Java code for this to work? Right now I tell the SDK that there are two nodes running in the cluster.
The same is also relevant to mobile users via sync.gateway.
Thanks in advance for any insights
I’m using Couchbase Community Edition for this solution (v. 6.0)
Well, any access to the database is failing - with something like this:
Caused by: java.lang.RuntimeException: java.util.concurrent.TimeoutException
at com.couchbase.client.java.util.Blocking.blockForSingle(Blocking.java:77)
at com.couchbase.client.java.CouchbaseBucket.get(CouchbaseBucket.java:131)
at com.couchbase.client.java.CouchbaseBucket.get(CouchbaseBucket.java:126)
at dk.dtu.aqua.catchlog.dao.CouchbaseUserDAO.loadUser(CouchbaseUserDAO.java:497)
Do you have automatic failover turned on? If so, when the 2nd node is taken down, the replicas on the 1st node should be promoted and the app should continue working. There may be a short period while this happens, and as with any data access, you should prepare for and handle exceptions. If automatic failover is NOT turned on, that could explain why you are seeing errors like this.
I’m going to move your question into the Java forum, just in case they have more insight for you there.
Caused by: java.lang.RuntimeException: java.util.concurrent.TimeoutException
at com.couchbase.client.java.util.Blocking.blockForSingle(Blocking.java:77)
at com.couchbase.client.java.CouchbaseBucket.get(CouchbaseBucket.java:131)
at com.couchbase.client.java.CouchbaseBucket.get(CouchbaseBucket.java:126)
at dk.dtu.aqua.catchlog.dao.CouchbaseUserDAO.loadUser(CouchbaseUserDAO.java:497)
… in different situations - that don’t have anything to do with a node being down… The problem seems to occur in the same situation like this line:
JsonDocument doc = getDb().get(getDocId(key));
And my getDb() method is defined like this:
private Bucket db = null;
private CouchbaseEnvironment dbEnv = null;
// Couchbase DB methods
public Bucket getDb() {
if (null == db) {
Util.trace("Get handle to Couchbase DB");
//this tunes the SDK (to customize connection timeout)
CouchbaseEnvironment env = DefaultCouchbaseEnvironment.builder().kvTimeout(15000) //15000ms = 15s
.connectTimeout(60000) //60000ms = 60s, default is 5s
.queryTimeout(120000) //120000ms = 120s
.build();
Cluster cluster = CouchbaseCluster.create(env, ConfigurationBean.get().getDatabaseServerNames());
// comment next line out to run with defaults....
dbEnv = DefaultCouchbaseEnvironment.builder().bufferPoolingEnabled(false).build();
if (dbEnv == null) {
Util.info("Running with default database settings...");
cluster = CouchbaseCluster.create(ConfigurationBean.get().getDatabaseServerNames());
} else {
Util.info("Running with special database tuning...");
cluster = CouchbaseCluster.create(dbEnv, ConfigurationBean.get().getDatabaseServerNames());
}
Util.info("Got handle to cluster: " + String.join(", ", ConfigurationBean.get().getDatabaseServerNames()));
cluster.authenticate(CB_DATA_USER, CB_DATA_PASSWORD);
Util.trace("Authenticated");
db = cluster.openBucket(CB_DATA_STORE);
Util.trace("Got handle to bucket");
}
return db;
}
The issue does not seem to arise when first opening and setting the db (the bucket) object - but apparently some time later when reusing the bucket object.
Is there a better way to open the bucket and cache it for next usage?
My environment: JVM: 1.8 Couchbase Server 6.0.0 build 1693 Java SDK 2.5.7 (due to the snappy error)
As long as you have a singleton holding the instance of your Bucket/CouchbaseEnvironment you are fine, so no big issues in the code above.
Regarding the main problem, timeouts are always the consequence of some other issue. It could be an unstable network, a cluster not properly sized for your volume of data, etc. As a quick fix, you could increase the kvTimeout The get operation uses couchbase key-value store internally (it does NOT use the query service, which makes the queryTimeout innefective in this case, and I would suppose that the same is valid for connectTimeout)
Well, it’s an application scope instance - so yes, there is only one at any time.
I have played around with the env. settings today - and, honestly, it seems the default is behaving more stable than mine (!!)
The network is internally on a VMware ESXi server - using the IP-adresses via a hosts file. In my Demo environment it may suffer a little from low ressources - however, there is only one or two users on it at any time, so I would expect to be able to run. But I can try to increase the kvTimeout parameter and let the others stay at default and see if that improves the situation.
Is there some timing (e.g. the connectTimout?) that I should take into account in relation to failover?
Apart from the kvtimeout, I think you should be fine with the defaults. Make sure set replication to 1 (as you only have 2 servers), so in case of a failure of one node, you still have a copy of the data in the other node.
I increased the kvTimeout and I have a long running task that reads and updates a few tens of thousand docs - and so far it has not timed out
It seems relatively slow though (CPU is high)… Could that be due to a SyncGateway wanting to import all the changes as well? If so should I take it offline in the SG? Or could it be re-indexing. I just have two cluster nodes and they both run these services: data, index, query and search.
Oh, btw, where should I set replication to 1? I just tried to find that - it’s not in the SyncGateway, I guess… So I may have overlooked something?
Good to hear that at least you are not getting timeouts.
Regarding being slow, that might be a variety of reasons: Missing indexes, low memory residence, not enough memory to de SO , etc. We would need a little bit more information to guess what is happening.
And I just had to click the bucket to unfold it, and it is Ok:
Replicas: 1 Server Nodes: 2
I appreciate it’s difficult to troubleshoot the slowliness - and not sure if it’s worth it as that job is just re-organizing some data and I’m using the application logic (and DAO’s) to do the re-org.
I was just curious as to where I could find/see any reasons for this. Still learning about Couchbase
Do you know if there are any other settings I should consider for the failover to work? Or should I change my code to actively discover and handle a failover? Any insights much appreciated
I suppose I could just try to take one of the servers down and see if it works now