Why does a data node failover cause a query timeout?

manideepla · August 27, 2024, 9:41am

Why would a data node auto-failover with a config of 60S cause query timeouts for 2 minutes?

Is there a way to see which data nodes the query nodes routed the request to?

Server config:
12 query and 18 data nodes distributed among 3 server groups.

mreiche · August 27, 2024, 5:51pm

Hi - Can you post the exception? There could be RetryReasons in the exception depending on the SDK. Which SDK are you using? Version?

manideepla · August 28, 2024, 7:24am

com.couchbase.client.core.error.AmbiguousTimeoutException: QueryRequest, 
Reason: TIMEOUT {"cancelled":true,"completed":true,"coreId":"0x53931e3400000001","idempotent":false,"lastDispatchedFrom":"192.168.85.25:59662","lastDispatchedTo":"10.115.218.157:18093","reason":"TIMEOUT",
"requestId":347799813,"requestType":"QueryRequest","retried":46,"retryReasons":["ENDPOINT_NOT_AVAILABLE"],"service":{"operationId":"331ead3f-ac4d-4881-b73e-7e5dee388bc0",
"statement":"-------------"},
"timeoutMs":20000,"timings":{"totalMicros":20005398}}
at com.couchbase.client.core.msg.BaseRequest.cancel(BaseRequest.java:184)
at com.couchbase.client.core.msg.Request.cancel(Request.java:70)
at com.couchbase.client.core.Timer.lambda$register$2(Timer.java:157)
at com.couchbase.client.core.deps.io.netty.util.HashedWheelTimer$HashedWheelTimeout.run(HashedWheelTimer.java:715)
at com.couchbase.client.core.deps.io.netty.util.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:34)
at com.couchbase.client.core.deps.io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:703)
at com.couchbase.client.core.deps.io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:790)
at com.couchbase.client.core.deps.io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:503)
at com.couchbase.client.core.deps.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:829)

I omitted the SQL++ statement within the query, but pasted everything else.

Java SDK - 3.3.x

mreiche · August 28, 2024, 2:25pm

The exception shows that the query service endpoint was not available and the SDK tried 46 times. There may be information logged earlier as to why the SDK could not connect to a query service. If you can reproduce the behavior, DEBUG logging might provide more information.
Can you retry with the latest version of the SDK (3.7.2)? There have been improvements in handling rebalancing since 3.3.

manideepla · August 29, 2024, 7:11am

Does it actually mean the query service was not available? Or the underlying service isn’t? (index in the case of covering index or data in the case of regular index)

Because there was no indication of either the query or index service being down this time.

And there was no rebalancing either during this time.

manideepla · August 29, 2024, 7:13am

Also, is there yet a way to read from a replica in case of a timeout and still make sure it isn’t a stale read? Anyway the updated SDK provides?

mreiche · August 29, 2024, 2:31pm

That’s not possible because the active node accepts changed to documents without telling the replicas that the document has been changed.

mreiche · August 29, 2024, 2:34pm

It means that the query service (which accepts client connections) was not available. Running with DEBUG logging would give more information about what happened leading up to that situation (i.e. specifics on what happened when the SDK attempted to connect). DEBUG logging will also log an event for every Retry.

manideepla · August 30, 2024, 7:32am

This is even when the durability is set to MAJORITY I presume?

We didn’t have DEBUG enabled on the SDK logs unfortunately, so we’ll have to see how it behaves the next time this happens I suppose with DEBUG on.

vsr1 · August 30, 2024, 11:38am

check query.log for errors (timeout from data node)

mreiche · August 30, 2024, 1:37pm

Using the kv-api - If you are sure all the mutations and deletions are with durability majority, then you can use “get all replicas” and have the application determine (1).if at least a majority of the active+replicas were returned; and (2) what the value of the majority is. I don’t know what query would do.

manideepla · September 9, 2024, 7:41am

You mean with a bucket with replica set to 2, if getAllReplicas() returns 2, it’d indicate the write is succeeded to all the nodes, not just the majority and we’re guaranteed to read correct data?

@mreiche

mreiche · September 9, 2024, 2:36pm

GetAllReplicas attemps to get from the active as well as as the replicas. Durability Majority with two replicas guarantees that two of the three have been updated. Therefore, GetAllReplicas with at least two identical results guarantees that is the latest.

system · December 8, 2024, 2:37pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to check Request Timeout value in Couchbase query node Couchbase Server	3	95	November 28, 2024
Timeouts on query after hard failover Couchbase Server	5	2302	April 19, 2016
Timeout exceptions after Couchbase reconnect (Java SDK) Java SDK timeout	1	2115	January 14, 2019
Frequent TimeoutException from Java SDK Java SDK query , connections , java	7	3709	January 13, 2021
Timeout error when querying couchbase Couchbase Server n1ql , server	5	2908	October 5, 2017

Why does a data node failover cause a query timeout?

Related topics