Get from replicate not working as expected

Hi All,

Hoping to get some help with an issue we’re seeing related to getting documents from the replicate after a node goes down. Summarized below.

Thanks!

Setup:

  • 5 node cluster
  • Number of replicas set to 3 on each bucket
  • Auto-failover set to 2 minutes
  • Confirmed auto-failover works when taking single node out of the cluster
  • Confirmed data is replicated by taking a node out and rebalancing on remaining nodes

Assumptions:

  • Our assumption is that we are able to still function by staying connected to the 4 nodes that are live and pull documents from the vReplica if needed (by first attempting to get the document, recovering and getting from replica)
  • What we’re seeing is that only 4/5 of the data is available after 1 node goes down, despite using the getFromReplica methods provided in the java client

What we’re trying to achieve:

  • Application can handle a single node going down with being affected
    • Connection attempt failures to downed node are fine, but ideally it would be able to recover from failed documented lookups by getting from the replica on a live node
    • Eventually node would auto-failover
  • In the event two nodes go down and auto-failover does not occur, we could still run with the remaining 3 nodes by getting existing data from the replicate until someone intervenes to manually failover and rebalance

Snippet of code we’re using to get from replica

asyncBucket.get(id, classOf[RawJsonDocument]).onErrorResumeNext(async.getFromReplica(id, ReplicaMode.ALL, classOf[RawJsonDocument])).singleOption

@ingenthr @vsr1 @daschl Can any of you help with this?

I think your assumptions are valid @dgrizzanti and what you’re trying to achieve is reasonable. I don’t see a description of how you’re triggering the failure or what behavior you’re seeing though.

I’ll defer to @daschl, but it could be related to the .onErrorResumeNext() in that it depends on how things are failing, In the case of a fall-off-the-network down node, the TCP connection is still half open, so the failure mode would be TimeoutException for a while. The problem is your default timeout for the overall get may be the same timeout value?

So, you may want to revisit how you’re creating the failure (best approach is to either down the network interface or have a firewall drop packets). and make sure that’s triggering the error you expect before chaining in the what-to-do-next.

Do note that in addition to TimeoutException, you can also see a CancellationException. The difference is that on timeout, the SDK is indicating it doesn’t know what has actually happened with the operation, while on cancellation, it’s telling you that it wasn’t sent to the network.

@ingenthr thanks for getting back to me. I should have given the sample failure scenario we tried in the original description, but will try to describe that now.

In order to test a scenario where a node failure occurs, we did the following:

  • Started with 3 active nodes, each bucket’s replica set to 3
  • Auto Failover is turned off
  • Created 1k documents while all 3 nodes were active
  • While no processes were running trying to access this data, we shut down the couchbase process on node 1
  • Run script that takes advantage of getFromReplica to try and retrieve all 1k documents across the remaining 2 nodes

In that scenario above, when 1 node as down we would always get 2/3 of the documents returned. This is not our normal use case but we wanted to test with something as straightforward as possible to test out the getFromReplica option.

Thanks