Simple solution with two nodes - fails entirely if one node is not available

jda · February 3, 2020, 6:54pm

I have a fairly small solution running on two cluster nodes. If one node fails (or is taken down for service) then the entire application fails!

Should it not be possible continue running? There is one node up and running fine - and it can just rebalance once the other node is back up and running. Should I do anything in particular in my Java code for this to work? Right now I tell the SDK that there are two nodes running in the cluster.

The same is also relevant to mobile users via sync.gateway.

Thanks in advance for any insights

I’m using Couchbase Community Edition for this solution (v. 6.0)

matthew.groves · February 7, 2020, 7:31pm

What actually happens when “the entire application fails”? Are you seeing error messages, exceptions, etc?

jda · February 13, 2020, 9:33pm

Well, any access to the database is failing - with something like this:

Caused by: java.lang.RuntimeException: java.util.concurrent.TimeoutException
	at com.couchbase.client.java.util.Blocking.blockForSingle(Blocking.java:77)
	at com.couchbase.client.java.CouchbaseBucket.get(CouchbaseBucket.java:131)
	at com.couchbase.client.java.CouchbaseBucket.get(CouchbaseBucket.java:126)
	at dk.dtu.aqua.catchlog.dao.CouchbaseUserDAO.loadUser(CouchbaseUserDAO.java:497)

matthew.groves · February 17, 2020, 2:34pm

Do you have automatic failover turned on? If so, when the 2nd node is taken down, the replicas on the 1st node should be promoted and the app should continue working. There may be a short period while this happens, and as with any data access, you should prepare for and handle exceptions. If automatic failover is NOT turned on, that could explain why you are seeing errors like this.

I’m going to move your question into the Java forum, just in case they have more insight for you there.

jda · February 18, 2020, 2:23pm

A very good question!

I thought I had automatic failover turned on. But can you quickly point me in the direction of the necessary steps? Then I’ll verify my settup

matthew.groves · February 18, 2020, 2:42pm

It is under General Settings on the UI (see docs https://docs.couchbase.com/server/6.5/manage/manage-settings/general-settings.html) under “Node Availability”. I’ve put a red box around it in this screenshot:

You can also control it with REST/cli.

jda · February 18, 2020, 4:23pm

Ok. I’m on version 6.0 (CE) so my node availability looks like this:

But I guess that should do the same?
Perhaps I should adjust the timeout?

jda · February 23, 2020, 1:12pm

It is also worth mentioning that I’m using the Java SDK 2.5.7 due to this snappy error.

I see some other issues with timeouts so I’ll try to update to latest maintenance release and Java SDK and see if that solves some of the issues.

jda · February 23, 2020, 5:52pm

Actually, I’m seeing this issue

Caused by: java.lang.RuntimeException: java.util.concurrent.TimeoutException
	at com.couchbase.client.java.util.Blocking.blockForSingle(Blocking.java:77)
	at com.couchbase.client.java.CouchbaseBucket.get(CouchbaseBucket.java:131)
	at com.couchbase.client.java.CouchbaseBucket.get(CouchbaseBucket.java:126)
	at dk.dtu.aqua.catchlog.dao.CouchbaseUserDAO.loadUser(CouchbaseUserDAO.java:497)

… in different situations - that don’t have anything to do with a node being down… The problem seems to occur in the same situation like this line:

JsonDocument doc = getDb().get(getDocId(key));

And my getDb() method is defined like this:

	private Bucket db = null;
	private CouchbaseEnvironment dbEnv = null;

	// Couchbase DB methods
	public Bucket getDb() {
		if (null == db) {
			Util.trace("Get handle to Couchbase DB");
			//this tunes the SDK (to customize connection timeout)
			CouchbaseEnvironment env = DefaultCouchbaseEnvironment.builder().kvTimeout(15000) //15000ms = 15s
					.connectTimeout(60000) //60000ms = 60s, default is 5s
					.queryTimeout(120000) //120000ms = 120s
					.build();
			Cluster cluster = CouchbaseCluster.create(env, ConfigurationBean.get().getDatabaseServerNames());
			// comment next line out to run with defaults....
			dbEnv = DefaultCouchbaseEnvironment.builder().bufferPoolingEnabled(false).build();
			if (dbEnv == null) {
				Util.info("Running with default database settings...");
				cluster = CouchbaseCluster.create(ConfigurationBean.get().getDatabaseServerNames());
			} else {
				Util.info("Running with special database tuning...");
				cluster = CouchbaseCluster.create(dbEnv, ConfigurationBean.get().getDatabaseServerNames());
			}
			Util.info("Got handle to cluster: " + String.join(", ", ConfigurationBean.get().getDatabaseServerNames()));
			cluster.authenticate(CB_DATA_USER, CB_DATA_PASSWORD);
			Util.trace("Authenticated");
			db = cluster.openBucket(CB_DATA_STORE);
			Util.trace("Got handle to bucket");
		}
		return db;
	}

The issue does not seem to arise when first opening and setting the db (the bucket) object - but apparently some time later when reusing the bucket object.

Is there a better way to open the bucket and cache it for next usage?

My environment:
JVM: 1.8
Couchbase Server 6.0.0 build 1693
Java SDK 2.5.7 (due to the snappy error)

deniswsrosa · February 24, 2020, 3:05pm

As long as you have a singleton holding the instance of your Bucket/CouchbaseEnvironment you are fine, so no big issues in the code above.

Regarding the main problem, timeouts are always the consequence of some other issue. It could be an unstable network, a cluster not properly sized for your volume of data, etc. As a quick fix, you could increase the kvTimeout The get operation uses couchbase key-value store internally (it does NOT use the query service, which makes the queryTimeout innefective in this case, and I would suppose that the same is valid for connectTimeout)

jda · February 24, 2020, 3:16pm

Well, it’s an application scope instance - so yes, there is only one at any time.

I have played around with the env. settings today - and, honestly, it seems the default is behaving more stable than mine (!!)

The network is internally on a VMware ESXi server - using the IP-adresses via a hosts file. In my Demo environment it may suffer a little from low ressources - however, there is only one or two users on it at any time, so I would expect to be able to run. But I can try to increase the kvTimeout parameter and let the others stay at default and see if that improves the situation.

Is there some timing (e.g. the connectTimout?) that I should take into account in relation to failover?

deniswsrosa · February 24, 2020, 3:41pm

Apart from the kvtimeout, I think you should be fine with the defaults. Make sure set replication to 1 (as you only have 2 servers), so in case of a failure of one node, you still have a copy of the data in the other node.

jda · February 24, 2020, 5:56pm

I increased the kvTimeout and I have a long running task that reads and updates a few tens of thousand docs - and so far it has not timed out

It seems relatively slow though (CPU is high)… Could that be due to a SyncGateway wanting to import all the changes as well? If so should I take it offline in the SG? Or could it be re-indexing. I just have two cluster nodes and they both run these services: data, index, query and search.

Oh, btw, where should I set replication to 1? I just tried to find that - it’s not in the SyncGateway, I guess… So I may have overlooked something?

deniswsrosa · February 24, 2020, 6:08pm

Good to hear that at least you are not getting timeouts.

Regarding being slow, that might be a variety of reasons: Missing indexes, low memory residence, not enough memory to de SO , etc. We would need a little bit more information to guess what is happening.

The replication config is on the bucket level: https://docs.couchbase.com/server/6.5/manage/manage-buckets/create-bucket.html#couchbase-bucket-settings

jda · February 24, 2020, 6:22pm

Yep, it’s definitely an improvement

And I just had to click the bucket to unfold it, and it is Ok:

Replicas: 1
Server Nodes: 2

I appreciate it’s difficult to troubleshoot the slowliness - and not sure if it’s worth it as that job is just re-organizing some data and I’m using the application logic (and DAO’s) to do the re-org.

I was just curious as to where I could find/see any reasons for this. Still learning about Couchbase

jda · February 25, 2020, 7:28am

Do you know if there are any other settings I should consider for the failover to work? Or should I change my code to actively discover and handle a failover? Any insights much appreciated

I suppose I could just try to take one of the servers down and see if it works now

Topic		Replies	Views
Node failure blocks Java client Java SDK	12	4888	April 5, 2017
Fail over failing Couchbase Server	1	629	May 27, 2018
Couchbase Cluster Coding using Java - Help Needed Couchbase Server	0	1750	February 3, 2015
Application stops working immediately when 1 node goes down Java SDK connections , sdk	2	1046	July 10, 2018
Question for connection timeout Java SDK	3	2268	September 19, 2016

Simple solution with two nodes - fails entirely if one node is not available

Related topics