Fail fast strategy, RequestCancelledException and couchbase 6.6

We are using latest java sdk 2.7.20 and we recently migrated to couchbase 6.6. We introduced a health check to check the cluster health every few seconds. We are seeing intermittent RequestCancelledException when using reactive n1ql using the fail fast strategy.

It takes 1 second before the error is propagated back to client and it is going to break our SLA to respond within 100-200 ms in case of n1ql to retry.

Stack trace

throwable_class=com.couchbase.client.core.RequestCancelledException" “throwable_message=Could not dispatch request, cancelling instead of retrying.” "stacktrace_elements=com.couchbase.client.core.retry.RetryHelper.retryOrCancel(RetryHelper.java:51),com.couchbase.client.core.service.PooledService.unsubscribeAndRetry(PooledService.java:420),com.couchbase.client.core.service.PooledService.access$1200(PooledService.java:50),com.couchbase.client.core.service.PooledService$9.onError(PooledService.java:401),rx.observers.SafeSubscriber._onError(SafeSubscriber.java:153),rx.observers.SafeSubscriber.onError(SafeSubscriber.java:115),rx.subjects.SubjectSubscriptionManager$SubjectObserver.onError(SubjectSubscriptionManager.java:227),rx.subjects.AsyncSubject.onError(AsyncSubject.java:116),com.couchbase.client.core.endpoint.AbstractEndpoint$2.onSuccess(AbstractEndpoint.java:465),com.couchbase.client.core.endpoint.AbstractEndpoint$2.onSuccess(AbstractEndpoint.java:422),rx.internal.operators.SingleOperatorOnErrorResumeNext$2.onSuccess(SingleOperatorOnErrorResumeNext.java:63),rx.internal.operators.SingleTimeout$TimeoutSingleSubscriber.onSuccess(SingleTimeout.java:79),com.couchbase.client.core.endpoint.AbstractEndpoint$4$1.operationComplete(AbstractEndpoint.java:399),com.couchbase.client.core.endpoint.AbstractEndpoint$4$1.operationComplete(AbstractEndpoint.java:383),com.couchbase.client.deps.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578),com.couchbase.client.deps.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571),com.couchbase.client.deps.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550),com.couchbase.client.deps.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491),com.couchbase.client.deps.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616),com.couchbase.client.deps.io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609),com.couchbase.client.deps.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117),com.couchbase.client.deps.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:262),com.couchbase.client.deps.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98),com.couchbase.client.deps.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170),com.couchbase.client.deps.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164),com.couchbase.client.deps.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472),com.couchbase.client.deps.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500),com.couchbase.client.deps.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989),com.couchbase.client.deps.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74),com.couchbase.client.deps.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30),java.lang.Thread.run(Thread.java:748)

Sample code

Observable
            .just(System.currentTimeMillis())
            .flatMap(t1 ->
                    primaryAsyncBucket
                            .query(N1qlQuery.simple("select ~BUCKETNAME~.* from ~BUCKETNAME~ USE KEYS ['HEALTHCHECK'] LIMIT 1".replace("~BUCKETNAME~", primaryAsyncBucket.name())))
                            .doOnError(t -> {
                              if(t instanceof RequestCancelledException) {
                                log.error(new LogEvent("performHealthCheck", "healthCheck - primaryN1QLCheck").addEntry("time_taken", System.currentTimeMillis() - t1).toString());
                              }
                            })
                            .flatMap(result -> {
                                      return result.errors()
                                              .flatMap(e -> {
                                                return Observable.<AsyncN1qlQueryRow>error(new CouchbaseException("N1QL Error/Warning: " + e));
                                              })
                                              .switchIfEmpty(result.rows());
                                    }

                            )
                            .map(AsyncN1qlQueryRow::value)

            )
            .firstOrDefault(null)
            .map(result -> result != null && result.getString("key_tx") != null)
            .onErrorResumeNext(t -> {
              log.error(new LogEvent("performHealthCheck", "healthCheck - primaryN1QLCheck").addThrowable(t).toString());
              return Observable.just(false);
            })

Any suggestions on how to fix this issue?

@himanshu.mps it looks like the socket connect attempt was not successful. Can you check the logs in the same timeframe on why the query socket could not be established?

@daschl Attached are the logs from our server. Hope this gives a clue. If you need more logs, let us know the settings that you want us to put in to debug the issue.

DefaultCouchbaseEnvironment.Builder builder = DefaultCouchbaseEnvironment
              .builder()
              .autoreleaseAfter(Duration.ofMinutes(5).toMillis())
              .connectTimeout(Duration.ofSeconds(30).toMillis())
              .dnsSrvEnabled(false)
              .retryStrategy(FailFastRetryStrategy.INSTANCE);
      couchbaseEnvironment = builder.build();
2021-08-25T13:52:31.13430452Z app_name=xxxxxxxxx ns=zzzz-api-v4 pod_container=xxxxxxxxx pod_name=xxxxxxxxx-deployment-204-zlp9k stream=stdout message=06:52:31.134 [cb-core-4-2] ERROR com.yyyy.zzzz.zzzz_common_module.couchbase.HealthCheckCouchbase - "name=performHealthCheck" "event_id=healthCheck - failoverN1QLCheck" "throwable_class=com.couchbase.client.core.RequestCancelledException" "throwable_message=Could not dispatch request, cancelling instead of retrying." "stacktrace_elements=com.couchbase.client.core.retry.RetryHelper.retryOrCancel(RetryHelper.java:51),com.couchbase.client.core.service.PooledService.send(PooledService.java:353),com.couchbase.client.core.node.CouchbaseNode.send(CouchbaseNode.java:215),com.couchbase.client.core.node.locate.QueryLocator.locateAndDispatch(QueryLocator.java:100),com.couchbase.client.core.RequestHandler.dispatchRequest(RequestHandler.java:258),com.couchbase.client.core.RequestHandler.onEvent(RequestHandler.java:207),com.couchbase.client.core.RequestHandler.onEvent(RequestHandler.java:78),com.couchbase.client.deps.com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:168),com.couchbase.client.deps.com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125),com.couchbase.client.deps.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30),java.lang.Thread.run(Thread.java:748)"
2021-08-25T13:52:27.3859271Z app_name=xxxxxxxxx ns=zzzz-api-v4 pod_container=xxxxxxxxx pod_name=xxxxxxxxx-deployment-204-zlp9k stream=stdout message=06:52:27.385 [cb-core-4-2] WARN  com.couchbase.client.core.utils.HealthPinger - Error while running PingService for QUERY
2021-08-25T13:52:27.377926682Z app_name=xxxxxxxxx ns=zzzz-api-v4 pod_container=xxxxxxxxx pod_name=xxxxxxxxx-deployment-204-zlp9k stream=stdout message=com.couchbase.client.core.RequestCancelledException: Could not dispatch request, cancelling instead of retrying.
2021-08-25T13:52:27.377841829Z app_name=xxxxxxxxx ns=zzzz-api-v4 pod_container=xxxxxxxxx pod_name=xxxxxxxxx-deployment-204-zlp9k stream=stdout message=06:52:27.377 [cb-core-3-2] WARN  com.couchbase.client.core.utils.HealthPinger - Error while running PingService for QUERY
2021-08-25T13:52:22.578710837Z app_name=xxxxxxxxx ns=zzzz-api-v4 pod_container=xxxxxxxxx pod_name=xxxxxxxxx-deployment-204-zlp9k stream=stdout message=06:52:22.578 [cb-io-1-3] INFO  com.couchbase.client.core.endpoint.Endpoint - [lpdospdbb50430.www.rrrr.com/10.9.132.71:8093][QueryEndpoint]: Got notified from Channel as inactive, attempting reconnect.
2021-08-25T13:52:22.577116725Z app_name=xxxxxxxxx ns=zzzz-api-v4 pod_container=xxxxxxxxx pod_name=xxxxxxxxx-deployment-204-zlp9k stream=stdout message=06:52:22.576 [cb-io-1-2] INFO  com.couchbase.client.core.endpoint.Endpoint - [lpdospdbb50430.www.rrrr.com/10.9.132.71:8093][QueryEndpoint]: Got notified from Channel as inactive, attempting reconnect.
2021-08-25T13:52:22.575183076Z app_name=xxxxxxxxx ns=zzzz-api-v4 pod_container=xxxxxxxxx pod_name=xxxxxxxxx-deployment-204-zlp9k stream=stdout message=06:52:22.574 [cb-io-1-1] INFO  com.couchbase.client.core.endpoint.Endpoint - [lpdospdbb50430.www.rrrr.com/10.9.132.71:8093][QueryEndpoint]: Got notified from Channel as inactive, attempting reconnect.
2021-08-25T13:52:22.455226727Z app_name=xxxxxxxxx ns=zzzz-api-v4 pod_container=xxxxxxxxx pod_name=xxxxxxxxx-deployment-204-zlp9k stream=stdout message=06:52:22.454 [cb-io-1-1] INFO  com.couchbase.client.core.endpoint.Endpoint - [lpdospdbb50431.www.rrrr.com/10.9.132.196:8093][QueryEndpoint]: Got notified from Channel as inactive, attempting reconnect.
2021-08-25T13:52:22.45520112Z app_name=xxxxxxxxx ns=zzzz-api-v4 pod_container=xxxxxxxxx pod_name=xxxxxxxxx-deployment-204-zlp9k stream=stdout message=06:52:22.454 [cb-io-1-3] INFO  com.couchbase.client.core.endpoint.Endpoint - [lpdospdbb50431.www.rrrr.com/10.9.132.196:8093][QueryEndpoint]: Got notified from Channel as inactive, attempting reconnect.
2021-08-25T13:52:22.455182783Z app_name=xxxxxxxxx ns=zzzz-api-v4 pod_container=xxxxxxxxx pod_name=xxxxxxxxx-deployment-204-zlp9k stream=stdout message=06:52:22.454 [cb-io-1-2] INFO  com.couchbase.client.core.endpoint.Endpoint - [lpdospdbb50431.www.rrrr.com/10.9.132.196:8093][QueryEndpoint]: Got notified from Channel as inactive, attempting reconnect.
2021-08-25T13:52:22.435949747Z app_name=xxxxxxxxx ns=zzzz-api-v4 pod_container=xxxxxxxxx pod_name=xxxxxxxxx-deployment-204-zlp9k stream=stdout message=06:52:22.435 [cb-io-1-1] INFO  com.couchbase.client.core.endpoint.Endpoint - [lpdospdbb50432.www.rrrr.com/10.9.132.180:8093][QueryEndpoint]: Got notified from Channel as inactive, attempting reconnect.
2021-08-25T13:52:22.408886242Z app_name=xxxxxxxxx ns=zzzz-api-v4 pod_container=xxxxxxxxx pod_name=xxxxxxxxx-deployment-204-zlp9k stream=stdout message=06:52:22.408 [cb-io-1-3] INFO  com.couchbase.client.core.endpoint.Endpoint - [lpdospdbb50432.www.rrrr.com/10.9.132.180:8093][QueryEndpoint]: Got notified from Channel as inactive, attempting reconnect.
2021-08-25T13:52:22.407566252Z app_name=xxxxxxxxx ns=zzzz-api-v4 pod_container=xxxxxxxxx pod_name=xxxxxxxxx-deployment-204-zlp9k stream=stdout message=06:52:22.407 [cb-io-1-2] INFO  com.couchbase.client.core.endpoint.Endpoint - [lpdospdbb50432.www.rrrr.com/10.9.132.180:8093][QueryEndpoint]: Got notified from Channel as inactive, attempting reconnect.