24x7 Uptime during a fail-over

We recently submitted an issue to Support about our application going offline during a graceful fail-over. Here’s the scenario:

Our entire application is built off a single data element called an “opportunity”. A user takes an action on an opportunity, and the document for that client/opportunity is updated. So far, so good during a graceful fail-over.

Here’s where the fail-over breaks:

After the user submits the opportunity change, it should naturally drop off their work queue since the item has been “worked”. Here’s the problem: The “queue” populates with a view, and the view has to be set to stale=false such that the opportunity properly drops off the users queue once they’ve completed their work.

The problem is that any view that’s called with stale=false will NOT return results during the graceful fail-over. stale=updatedAfter isn’t a good option as the user will continue to see the opportunity on their queue until the graceful fail-over is complete (which could take 10-15 minutes!).

This is a SUPER common scenario, no different than someone adding/removing things from a shopping cart in your average online shopping site.

Is there a better way we should be doing this? The engineer behind our support ticket said that N1QL has the same issue, but we didn’t dive real deep into it. Is there an architecture or software change we can make such that the system is actually 24x7 during a graceful fail-over?

If our application goes down for 10-15 minutes during a graceful fail-over, that’s not really a 24x7 system.

It’s interesting that in this post the submitter is having a similar issue. Is the concept of a view request with stale=false not a widely know issue?

One note on the comment “…system going down for 10-15 mins…” While we could change our code detect a fail-over and change the view call to stale=Ok, that’s still not really a 24x7 system (for us). If the changes our users make aren’t reflected in the UI, then to them it’s looks “down” or not functioning properly.

Note that we considered down nodes views queries in the design. View queries can include a directive on what to do when an error is encountered. At the REST interface, the param is on_error and it takes continue or stop as arguments.

That’s available in all of the SDKs in whatever is idiomatic to that platform.

Note this is just during the period of time that a node is failing or has failed. Once a failover has completed, which can be 30s to minutes with autofailover in 4.x, everything goes back to normal.

One of the cool new features in 5.0 is more advanced fault detection which allows for faster failovers. 5.0 beta 2 has fast failover on the order of about 10s.

I should mention here that the proper way to remove a node is click the “remove” and “rebalance”. It’ll take longer, but then your stale=false view queries won’t be impacted by the node being removed. Graceful failover is perhaps misnamed, since it’s only graceful at the replication level. It’s not graceful for any of the other services or graceful for applications.