Due to a possible bug in Couchbase 7.0.1 CE and both EE [1], we experience frequent crashes of the indexing service, which yields the following errors in the Java SDK after the indexer & query service crashes.
Internal Couchbase Server error
{
"completed":true,
"coreId":"0x3dc04a9600000002",
"errors":[
{
"code":5000,
"message":" dial tcp 10.233.82.91:9101: connect: connection refused from [seven-couchbase-2.seven-couchbase.development.svc.sigma:9101] - cause: dial tcp 10.233.82.91:9101: connect: connection refused from [seven-couchbase-2.seven-couchbase.development.svc.sigma:9101]"
}
],
"idempotent":true,
"lastDispatchedFrom":"10.240.0.10:41586",
"lastDispatchedTo":"seven-couchbase-2.seven-couchbase.development.svc.sigma:8093",
"requestId":202,
"requestType":"QueryRequest",
"retried":0,
"service":{
"operationId":"960c557d-95f8-4ccf-919d-889440cf8857",
"statement":"SELECT b.* FROM `bucket`.`default`.`sescus-A_e` b WHERE clientID = $byField",
"type":"query"
},
"timeoutMs":40000,
"timings":{
"dispatchMicros":7488504,
"totalDispatchMicros":7488504
}
}
The query is executed as follows:
val f = $.cluster.query(
s"SELECT b.* FROM ${$.scalaEntityCollection.from} b " +
"WHERE " + name + " = $byField",
QueryOptions()
.readonly(true)
.parameters(
Named(
"byField" -> value
)
)
)
We set the SELECT
query to be idempotent
, however, the SDK will not attempt any retries, possibly, because the type of error is not eligible for retry by the SDK. Therefore, we retry outside of the SDK as suggested by the SDK documentation. Below is an experimental implementation of the blocking retry mechanism:
protected[storage] def withComplementaryRetry[R](
f: => R
)(implicit couchbaseConfiguration: Couchbase.Configuration): R = {
val querySuccess = new retry.Success[R](_.isInstanceOf[R])
Await.result(
retry.JitterBackoff(
Int.MaxValue,
couchbaseConfiguration.complementaryRetryStrategy.baseDelay
)(odelay.Timer.default, retry.Jitter.full(cap = 3.seconds)) {
val future = Future(f)(Couchbase.complementaryRetryingThreadPool)
future.onComplete {
case Failure(exception) =>
log.warn(
"Could not complete query successfully due to error [{}] with message [{}]!",
exception.getClass.getName,
exception.getMessage
)
case Success(_) => ()
}
future
}(querySuccess, Couchbase.complementaryRetryingThreadPool),
couchbaseConfiguration.complementaryRetryStrategy.timeout
)
}
When the above Internal Couchbase Server error
is observed, the query by the withComplementaryRetry
will be retried (because the SDK refuses to do so). However, even after Couchbase recovers from the indexer crash quickly, the SDK does not recover from Internal Couchbase Server error
, even after 300 seconds. While the 300 seconds are ticking down and the withComplementaryRetry
attempts, again and again, a new SDK instance constructed during the retrying can complete queries successfully, I think the SDK can not recover from these Internal Couchbase Server error
s.
The above question is relevant because Couchbase 7.0.1 CE and EE have an issue with the indexing service [1], and protecting against these cases is now relevant.
Is there any way to force the SDK programmatically to recover from these errors for idempotent queries?