Egor Kovalchuk is a deployment manager in MegaFon, one of the largest telecommunication companies in Russia.  His career in telecom spans more than a decade.  Egor’s team is in charge of developing, integrating, and monitoring multiple business systems and applications spanning the eleven time zones of Russia.

Egor Kovalchuk

Couchbase in Telecom: MegaFon, Russia

Digital transformation is a global trend for companies large and small.  It is vital for enterprises to adapt to the modern customer needs.  Customers are used to highly available and real-time systems provided by the industry leaders (Google, Amazon, Netflix), and they demand the same experience from all market players.

Russian telecommunication companies are fully affected by this trend.  New customer-friendly features need to be rolled out fast and scale easily.  We need to react quickly to the moves of our competitors.  All this needs to be delivered while managing the rising costs of the IT spending (infrastructure, data centers, skilled personnel).  That’s where new technologies, like in-memory cache and NoSQL databases, come in handy.

I will describe the following two use cases we use in MegaFon with in-memory databases:

  • Simple caching: the cache gets populated and updated on a schedule, as well as by database and application events.
  • Write-through caching: the changes in cache propagate to the main database (e.g., Oracle database gets updates from Couchbase DCP stream).

The first approach is used in our decision-making system for the subscriber’s life cycle.  A single application analyzes multiple factors, makes a decision, and sends the change out to multiple systems (including an Oracle database).  An example of such an application is locking and unlocking of prepaid plan accounts.  When the prepaid amount is used up, the account gets locked, and no service is provided, until the customer refills it.  Once the account is refilled, the service needs to be re-enabled as soon as possible.  Thanks to the use of Couchbase, we have shortened the service reinstatement time from 90 to 30 seconds, and there is still room for improvement.  The only update sent to the main database is the account status change (see Figure 1 below).

Figure 1: MegaFon Fast Account Status Update

Figure 1: MegaFon Fast Account Status Update Process

Why did we choose Couchbase Server as part of our digital transformation efforts?  Let’s look at our performance requirements and see how well Couchbase fits them.

NoSQL Database Performance Requirements

  • Processing throughput: up to 200,000 requests per second.
  • Average latency (50%), single cluster: within 5 ms.
  • Maximum latency (99%), single cluster: within 15 ms.
  • Maximum insert throughput: 500 MB per second.
  • Maximum number of insert operations: 100,000 per second.
  • Maximum update throughput: 500 MB per second.
  • Maximum number of update operations: 100,000 per second.
  • Maximum read throughput: 500 MB per second.
  • Maximum number of read operations: 100,000 per second

High Performance Key-Value Data Access

Couchbase Server, at its core, is a distributed Key-Value (KV) database.  KV storage is a simple data management approach that stores a unique identifier (key) along with a piece of arbitrary information (value).  The value can be a binary object (BLOB/blob) or a JSON-document.  Due to the simplicity of KV implementation (especially when compared to relational databases), data access is provided with minimal latency.  In our deployments, the network latency is often 2-3 times higher than the time to complete a KV operation in Couchbase cluster.

Flexible Data Format (JSON)

JavaScript Object Notation (JSON) is the preferred data storage format in Couchbase.  The format supports both primitive (booleans, numbers, strings) and composite types (arrays, lists, dictionaries) data types.

The data schema of your JSON documents can be easily changed based on your changing application requirements.  Different schema versions can be tracked by an additional document field, thus ensuring smooth application upgrades and backward compatibility.

High Availability

Couchbase Sever provides several features to support high availability (HA) of the data.  Intra-cluster data replication (distribution of several copies of data on different servers within a single cluster) keeps data 100% available during scheduled server maintenance or unexpected server failures.

Figure 2: Couchbase Intra-cluster Replication

Figure 2: Couchbase Intra-cluster Replication

The Database Change Protocol (DCP) is another important HA component of Couchbase.  This high-performance streaming protocol communicates data changes to internal and external consumers.  It is responsible for maintaining global secondary indexes (GSI) used for SQL querying, full-text search (FTS) indexes, cross data center replication (XDCR, inter-cluster replication protocol), and other services.

Bidirectional (Inter-Cluster) Replication

Redundant applications and equipment have long been a best practice in the industry.  Ideally, your distributed databases are deployed using Active-Active (AA) architecture, when applications automatically switch to a different cluster in case of connectivity issues. mode, when switching between problem nodes occurs automatically.  Couchbase Server supports bidirectional XDCR to enable AA scenarios.  However, the eventual consistency of data in multi-cluster deployments may not be acceptable for certain business applications.

In our environment, we found out that when data centers are located more than 100 km apart, data conflict resolution becomes a challenge.  Couchbase provides two conflict resolution mechanisms:  revision-based and timestamp-based.  Due to our network latency, neither of them could provide acceptable data consistency in full AA scenario (when writes and reads can happen on any cluster).  As a result, we implemented the architecture where all changes (writes) are made on a single cluster that propagates changes to other data centers.  The applications can read data from either data center.

Horizontal Scaling

Horizontal scaling (increasing cluster resources by adding new servers) is a big selling point of NoSQL databases.  The important feature of Couchbase Server is the ability to scale different cluster loads (KV operations, SQL queries, data indexing, etc.) independently; Couchbase calls it “multi-dimensional scaling, MDS” (see Figure 3 below).  Every node in Couchbase cluster can run a single service or multiple services; the choice is made when a node is added to the existing cluster.

Figure 3: Couchbase Multi-dimensional Scaling (MDS)

Figure 3: Couchbase Multi-dimensional Scaling (MDS)

Information Security Requirements

Couchbase security features were another reason (albeit not primary) for making it our system of choice.  Since it is possible for the personally identifiable information (PII) to be stored in cache, our company has to be in compliance with the governing laws.  If the data platform does not offer the necessary security features, it may be necessary to purchase additional hardware to be compliant

Couchbase Server Enterprise Edition (EE) supports traffic encryption, data encryption, and role-based access control (RBAC).  This potentially allows you to save on network security hardware, such as Cisco ASA.

Ease of Upgrade

Couchbase Server provides several different upgrade options.  The online upgrade option allows your applications to continue working with the cluster with minimal impact to performance and functionality (due to API backward compatibility).  While cluster nodes are being upgraded, the cluster will continue working in compatibility mode; the features of the new version will become active once all the nodes complete the upgrade.

Additional Functionality

Server Group Awareness (Rack Awareness, Availability Zones)

In Couchbase, individual servers can be assigned to specific server groups.  This feature is similar to cloud service availability zones (AZ).  When server groups are used, the distribution algorithm of active data and replicas assures full data availability in case the entire server group offline.

In the telco world, this allows us to keep fully replicated datasets in different equipment rooms of the data center.  If the entire equipment room (mapped to a Couchbase server group) goes offline, the applications will continue working with the full dataset in the other equipment room.

Figure 4: Couchbase Server Groups

Figure 4: Couchbase Server Groups

Backup and Restore

Couchbase provides several backup and recovery tools; cbbackupmgr is available in EE only.  The backup can be done in three different ways:  full, differential, and cumulative.  The correct combination of these backup modes allows to save disk space and optimize system resource usage.

Figure 5: Couchbase Backup Combining

Figure 5: Couchbase Backup Combining

Couchbase vs. MongoDB

Choosing a NoSQL database from the available competing technologies may be challenging.  [At least with Linux OS it’s easier:  the best Linux distro is the one your system administrator knows best.]  We summarized in the table below some of the important differences that convinced us to choose Couchbase technology over another popular NoSQL platform, MongoDB.

It is admittedly difficult to compare two different projects with different architectures and functionality.  It was important for us to have the system that is easy to maintain and can be quickly adjusted to the changing business needs.

Couchbase MongoDB
Sharding Automatic for entire dataset Manual key selection per collection
Data Distribution Data always uniformly distributed over all data nodes Range sharding can result in non-uniform distribution
Adding/Removing Nodes or Shards Simple, completed in one step (over GUI, REST API, or CLI), followed by a rebalance Complex, need to create replica sets.  Each collection scales differently
Rack Awareness Built-in and easy, via server groups Not built-in, need to manually allocate replica set nodes from different racks.
Balanced Setup Cluster is always balanced with each node having equal number of active vBuckets (shards). Not balanced.  Secondary nodes do not serve any write traffic (not even read traffic by default).
Index Scale-Out Independently scale indexes of data.  Can even use different kind of hardware for index nodes. Index scaling tied coupled with data scaling.  Have to add capacity to data cluster to accommodate changes in query workload.
Cluster Metadata No special nodes needed, distributed across all data nodes. Special configuration servers need to be set up, a minimum of 3 nodes
Replication Architecture Completely independent cluster, which can be scaled and managed without any dependencies. Extension of intra-cluster replication, not an independent system.
Replication Flexibility Very flexible; bucket level, advanced optimization techniques to tune to the need. Tuning, choosing speed, bandwidth is not possible.
Replication Topology Support for complex topologies:  bidirectional, star, mesh, chain, ring, etc. No support for complex topologies:  unidirectional, star.  Primary is a bottleneck.
Active-Active Replication Supported Not supported

Overall, Couchbase is more flexible and easier to maintain and configure for MegaFon’s use cases and the constantly evolving hybrid architecture.

Our Couchbase Journey So Far

Below are some quick stats from our production Couchbase cluster and its load:

  • The cluster handles data of more than 80 million subscribers. This number includes mobile phones, LTE routers, multiple consumer devices with built-in SIM cards, etc.
  • 380 million JSON documents with customer data
  • 3.5 TB disk storage (active dataset, replicas not included)
  • 3 TB RAM
  • 50,000 operations per second, sustained (see Figure 6 below)
  • 50 microservices processing the entire message flow
Figure 6: Couchbase Production Load

Figure 6: Couchbase Production Load

This transformation project started with Couchbase Server version 3.x.  Initially, all applications worked stably.  However, when we added new application functionality that relied on the use of Couchbase views, we started experiencing issues with unpredictable behavior of views.  [Views are a map/reduce indexing and querying mechanism.  It is considered a legacy feature in Couchbase Server 5.x-6.x.  It is getting phased out in favor of global secondary indexes and N1QL querying.]  The view update process on a node would sometimes freeze, which prevented applications from receiving the view data from this node.  The KV operations would continue performing normally.

This issue could be fixed by restarting the node, which affected data availability.  As a temporary workaround (while we were planning to upgrade to Couchbase Server version 4.x), Couchbase technical support suggested the following version-specific undocumented command to restart only the view update process:

Another issue we encountered with Couchbase Server version 3.x was periodic termination of the compaction process.  The process had to be restarted manually upon receiving monitoring alarms.  Both of these production issues were a headache for operations and development staff alike.

Following Couchbase technical support recommendation, we decided to upgrade to Couchbase Server version 4.x.  The overall upgrade process took about two weeks, since we had to assure with minimal impact on the production applications.  The upgrade steps are pretty straightforward, but the rolling online upgrade – including node removal, rebalance, node upgrade, node addition to the cluster, and another rebalance – would take 2+ hours.  We were able to optimize this process by introducing an extra node to take advantage of Couchbase swap rebalance.  In this case, the data is copied directly from the node being removed to the node being added, which greatly speeds up the rebalance.  This reduced the upgrade time per node to 30 minutes.

When upgrading a production Couchbase cluster, you should keep in mind that the cluster will operate in compatibility mode, when only the features from the older version will be available across all nodes.  This ensures the smooth and painless upgrade.  The downside is that the new features and fixes of the upgraded version (N1QL indexes and queries, full-text search, etc.) will only become available when all nodes in the cluster are upgraded.

Our initial upgrade to version 4.x fixed only the issue with compaction.  The view issue remained, although it arose not nearly as often.  It was fully remedied only in Couchbase Server version 4.6.4.

We were also notified by Couchbase technical support that views functionality will no longer be improved and is on the way to be deprecated.  Global Secondary Indexes (GSI) and N1QL (pronounced as “nickel”, Couchbase SQL implementation) queries are the much better scalable alternative for views.  Index and query loads can be scaled independently, without being tied to data nodes (see Figure 7 below):

Figure 7: Couchbase Services on Different Nodes

Figure 7: Couchbase Services on Different Nodes

With the upgrade to Couchbase Sever 4.6.4, we have resolved all our critical production issues.  However, the new features and improvements of Couchbase Server 5.1 compelled us to complete another upgrade cycle.  With the new GSI engine, our indexes now take up about 1.5 times less disk and memory space, which helps with our increasing data volume.  Unfortunately, version 5.1 does not offer any data storage improvements (in-memory or on-disk).  Data compression has been added in version 5.5.

Conclusion

Overall, Couchbase Server proved to be a mature, high performance data platform.  As part of the MegaFon hybrid architecture, Couchbase cluster can be easily adapted for any production load without equipment downtime or extensive configuration changes to the servers.  This generally results in personnel costs reduction and customer satisfaction.

 

The original article has been published in a popular Russian collaborative IT blog, Habrahabr: https://habr.com/ru/post/436762/

Author

Posted by Oleg Kuzmin, Sr. Solutions Engineer, Couchbase

Leave a reply