Automatic failover in an environment where any server could die at any time

So this is a bit of a tricky situation to manage, especially automatically - one DB that does seem to handle this very well is Cassandra.

Essentially I’m working on a project where we’re evaluating Couchbase, and we’re hoping to host in Google Cloud Platform on preemptible nodes that could be turned off at any time. We’ll probably start with 3 nodes which fulfils the needs of Couchbase initially, however we’re wondering a few things:

  1. If a node is turned off, and a new node comes on in it’s place (i.e. not the same node turning back on) how does Couchbase handle that? Would we end up with a 4 node cluster with 1 permanently offline?
  2. Can a new node heal from having no data, and rebalance automatically when it joins the cluster?
  3. If you have more than 3 nodes, can more than one fail at once? (The bit of research we’ve done so far suggests if one node fails, anything past that will require manual intervention - which is not ideal, even though the reasoning is understandable).
  4. Can you automatically rebalance?

Perhaps a different approach could also be suggested in place of the above? Either way, Couchbase looks awesome and seems to fulfil a lot of our needs, but this is a fairly critical part of what we need. We’re also wondering if we could just write our own thin supporting software that handles some of the above too?

Hi @SeerUK

I’m not from couchbase itself but I can answer some of your questions:

  1. When you’ve activiated automatic failover the node, which goes done, will be hard failoverd and stay in the cluster until you rebalance but doesn’t answer any requests and has no data on it. When you press on this node the remove button and add a new server into the cluster and then start the rebalance you could be happy to have a swap rebalance (not tested if this works too when the “source” node is failovered). Swap rebalance is very fast because it only copies the data from one node

  2. you can add as many node (there will be somewhere a limit) into cluster as you want. but you need to start the rebalance by yourself (a script maybe which gets triggered when a new node started up)

  3. nop. there is only 1 replica per shard so you need to rebalance first before you can fail another node without data loss

  4. I think it is possible to start the rebalance over the rest api

I recommend to have servers which doesn’t just go down because google wants that they go down…