Hi, Sorry in advance for the long post, but this topic is a bit complex.
My company uses Couchbase as a data-store for logic that processes signaling transactions from 3G and 4G mobile data networks. The functionality we deliver together is roughly equivalent for all our customers at a functional level, however the network topology in place at larger carriers is significantly different than for regional ones.
Larger volume operators typically require Couchbase performance in the hundreds of thousands of couchbase transactions per second. We deploy together into unique network topologies with customer specific fault tolerant, highly available, geo-redundant architectures. Our technical teams work together to ensure each combined solution can be supported in a live tier one production mobile network.
Compared to the national carriers, regional operators have very straight forward network topologies, so they benefit greatly from more streamlined solution architectures. Regional operators also make up a respectable chunk of our revenue stream, so the entire ecosystem benefits if we provide these customers a solid solution and reasonable plan-path for growing it. A simplified architecture also has a positive impact on our margin because it is easier for us to sell and support the same solution repeatedly into this market.
A “simplified architecture” consists of a dual site, non-clustered, single node, XDCR based synchronization approach. After reading the Couchbase blogs, I understand this is not the “three node minimum” favored approach, but we don’t actually require clustering at all, just replication. For this reason, I’m hopeful this reduced approach will make sense. We would like to move forward with several pending architecture reviews, but are unable until engaging with you on this.
For regional operators any reduction in complexity and hardware footprint is a big win especially in light of their reduced performance and unique redundancy requirements. Specifcially:
-
Processing Requirements Are Relatively Low: like in the hundreds, or low-thousands of transactions per second. For this discussion we can assume peak rates of 4K TPS or less for any single Couchbase node.
-
A Reduced Hardware Footprint Is Desirable: Our solution is typically deployed in two separate data centers. This means that a three node cluster in each datacenter (with redundancy) requires six nodes, (as per the “How Many Nodes?” series of lectures). This architecture lopsides us considerably and when the customer understands that a single combined function node (running on a single laptop) more than handles twice their load. This begs an answer to obvious question, which is “Why am I putting in eight boxes when I only really need two to be redundant and one would handle all the load from the network?”
-
Failover and geo-redundancy complexity are drastically simplified by leveraging the network gateway’s failure mechanism (primary network interface fails to a secondary). The result for the application is the buckets we use to manage usage in Couchbase are no longer directed at sessions on the primary node. On failure those sessions stick with the secondary node until the problem can be addressed (failover is driven by the network not the application). So these network topologies are like “The HA Highlander”: one failure mechanism to rule them all. Deploying on a single Couchbase node and using XDCR to replicate avoids having to support many levels of redundant failover triggers, sharing of IP addresses and complicated recovery procedures. In the case of mobile data charging, if the gateway fails our application has no purpose anyhow, so there is no point of being able to operate in such an environment.
In the end, the perfect Couchbase solution we are advocating for these customers consists of a dual node configuration (one Primary and one Secondary) with bidirectional XDCR to keep things in sync. A network failover from Primary Control to Secondary Control “stays failed” until recovery operations can be completed. This means the Secondary Control Platform is used only in the event of a failure and is only needed long enough for the customer to troubleshoot and recover the Primary Platform. This model reduces contention issues on the database and duplicates usage reporting should the Primary Control Server (my company’s application), or the network gateway (customer’s box) exhibit “transient” availability. It also reduces the solution footprint considerably from eight nodes down to two yet still provides a roadmap for growth by adding additional nodes when/if required at either the Couchbase layer or the App.
Can anyone help with a recommendation for the best way to pursue proposing such a solution model that would be supported in production by the Couchbase organization?
Thanks in advance,
Bryan