Why does a data node failover cause a query timeout?

Why would a data node auto-failover with a config of 60S cause query timeouts for 2 minutes?

Is there a way to see which data nodes the query nodes routed the request to?

Server config:
12 query and 18 data nodes distributed among 3 server groups.

Hi - Can you post the exception? There could be RetryReasons in the exception depending on the SDK. Which SDK are you using? Version?

com.couchbase.client.core.error.AmbiguousTimeoutException: QueryRequest, 
Reason: TIMEOUT {"cancelled":true,"completed":true,"coreId":"0x53931e3400000001","idempotent":false,"lastDispatchedFrom":"192.168.85.25:59662","lastDispatchedTo":"10.115.218.157:18093","reason":"TIMEOUT",
"requestId":347799813,"requestType":"QueryRequest","retried":46,"retryReasons":["ENDPOINT_NOT_AVAILABLE"],"service":{"operationId":"331ead3f-ac4d-4881-b73e-7e5dee388bc0",
"statement":"-------------"},
"timeoutMs":20000,"timings":{"totalMicros":20005398}}
at com.couchbase.client.core.msg.BaseRequest.cancel(BaseRequest.java:184)
at com.couchbase.client.core.msg.Request.cancel(Request.java:70)
at com.couchbase.client.core.Timer.lambda$register$2(Timer.java:157)
at com.couchbase.client.core.deps.io.netty.util.HashedWheelTimer$HashedWheelTimeout.run(HashedWheelTimer.java:715)
at com.couchbase.client.core.deps.io.netty.util.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:34)
at com.couchbase.client.core.deps.io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:703)
at com.couchbase.client.core.deps.io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:790)
at com.couchbase.client.core.deps.io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:503)
at com.couchbase.client.core.deps.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:829)

I omitted the SQL++ statement within the query, but pasted everything else.

Java SDK - 3.3.x

The exception shows that the query service endpoint was not available and the SDK tried 46 times. There may be information logged earlier as to why the SDK could not connect to a query service. If you can reproduce the behavior, DEBUG logging might provide more information.
Can you retry with the latest version of the SDK (3.7.2)? There have been improvements in handling rebalancing since 3.3.

Does it actually mean the query service was not available? Or the underlying service isn’t? (index in the case of covering index or data in the case of regular index)

Because there was no indication of either the query or index service being down this time.

And there was no rebalancing either during this time.

Also, is there yet a way to read from a replica in case of a timeout and still make sure it isn’t a stale read? Anyway the updated SDK provides?

That’s not possible because the active node accepts changed to documents without telling the replicas that the document has been changed.

It means that the query service (which accepts client connections) was not available. Running with DEBUG logging would give more information about what happened leading up to that situation (i.e. specifics on what happened when the SDK attempted to connect). DEBUG logging will also log an event for every Retry.

This is even when the durability is set to MAJORITY I presume?

We didn’t have DEBUG enabled on the SDK logs unfortunately, so we’ll have to see how it behaves the next time this happens I suppose with DEBUG on.

check query.log for errors (timeout from data node)

Using the kv-api - If you are sure all the mutations and deletions are with durability majority, then you can use “get all replicas” and have the application determine (1).if at least a majority of the active+replicas were returned; and (2) what the value of the majority is. I don’t know what query would do.

You mean with a bucket with replica set to 2, if getAllReplicas() returns 2, it’d indicate the write is succeeded to all the nodes, not just the majority and we’re guaranteed to read correct data?

@mreiche

GetAllReplicas attemps to get from the active as well as as the replicas. Durability Majority with two replicas guarantees that two of the three have been updated. Therefore, GetAllReplicas with at least two identical results guarantees that is the latest.