Hoping to get some help with an issue we’re seeing related to getting documents from the replicate after a node goes down. Summarized below.
Thanks!
Setup:
5 node cluster
Number of replicas set to 3 on each bucket
Auto-failover set to 2 minutes
Confirmed auto-failover works when taking single node out of the cluster
Confirmed data is replicated by taking a node out and rebalancing on remaining nodes
Assumptions:
Our assumption is that we are able to still function by staying connected to the 4 nodes that are live and pull documents from the vReplica if needed (by first attempting to get the document, recovering and getting from replica)
What we’re seeing is that only 4/5 of the data is available after 1 node goes down, despite using the getFromReplica methods provided in the java client
What we’re trying to achieve:
Application can handle a single node going down with being affected
Connection attempt failures to downed node are fine, but ideally it would be able to recover from failed documented lookups by getting from the replica on a live node
Eventually node would auto-failover
In the event two nodes go down and auto-failover does not occur, we could still run with the remaining 3 nodes by getting existing data from the replicate until someone intervenes to manually failover and rebalance
I think your assumptions are valid @dgrizzanti and what you’re trying to achieve is reasonable. I don’t see a description of how you’re triggering the failure or what behavior you’re seeing though.
I’ll defer to @daschl, but it could be related to the .onErrorResumeNext() in that it depends on how things are failing, In the case of a fall-off-the-network down node, the TCP connection is still half open, so the failure mode would be TimeoutException for a while. The problem is your default timeout for the overall get may be the same timeout value?
So, you may want to revisit how you’re creating the failure (best approach is to either down the network interface or have a firewall drop packets). and make sure that’s triggering the error you expect before chaining in the what-to-do-next.
Do note that in addition to TimeoutException, you can also see a CancellationException. The difference is that on timeout, the SDK is indicating it doesn’t know what has actually happened with the operation, while on cancellation, it’s telling you that it wasn’t sent to the network.
@ingenthr thanks for getting back to me. I should have given the sample failure scenario we tried in the original description, but will try to describe that now.
In order to test a scenario where a node failure occurs, we did the following:
Started with 3 active nodes, each bucket’s replica set to 3
Auto Failover is turned off
Created 1k documents while all 3 nodes were active
While no processes were running trying to access this data, we shut down the couchbase process on node 1
Run script that takes advantage of getFromReplica to try and retrieve all 1k documents across the remaining 2 nodes
In that scenario above, when 1 node as down we would always get 2/3 of the documents returned. This is not our normal use case but we wanted to test with something as straightforward as possible to test out the getFromReplica option.