The goal of load testing should be to identify what load your current cluster can handle, and how you need to mutate and adapt your cluster configuration as you reach various load milestones. The result should be a plan for how to adapt your cluster to handle user growth and load before you need this information. This allows you the luxury of planning ahead, keeping your applications and databases performing with the desired SLAs, with no dips in service as your load grows beyond the capacity of your current configuration.
Understand that this is not information that anyone is going to provide you without performing a significant number of tests. There are many variables to consider such as access patterns, data structures, application code, network, hardware, etc. This applies not just in a self-hosted data center, but also applies in cloud-hosted environments. There is no substitute for performing this preemptive load testing for your specific use case.
Note: If you are hosting your Couchbase cluster in a cloud environment (AWS, Azure, Google Cloud, etc.), you will streamline the process of creating various Couchbase cluster configurations by leveraging the Couchbase Kubernetes Autonomous Operator. This tool allows you to make changes to the cluster configuration in the Kubernetes yaml file, and then the changes to the cluster are automatically deployed. It eliminates a large amount of system administrative work installing and configuring Couchbase on nodes, adding and removing nodes from a cluster, performing rebalances, and so on.
Specify Your SLAs
Before you run any performance tests, you need to have your SLAs (Service Level Agreement) articulated. “As fast as possible” is not an SLA. You need to have SLA specified so that your test performance becomes a pass-fail situation. Either your cluster is performing at or better than what you’ve determined it needs to, or it isn’t. If you haven’t determined what that response time is, you’ll never be able to determine at what load the performance has entered the unacceptable range and needs to be scaled up to re-perform at the level you need.
So how do you determine what your SLA needs to be? Start with your most complex process that needs to complete within a limited window. Often this will be user interaction that you don’t want a user to wait more than X seconds to have completed. Occasionally it’ll be a micro-service process that needs to be as real-time as possible. In short, it has to be some process that has a defined start and end point, and a specific time window that it needs to complete within.
Once you have your process, you need to identify the number of database interactions that are performed during this process, along with the nature of each of these interactions. Once these have been identified, you’ll need to remove the database interactions to get some base-line process timings. This is where Mock objects need to come into play. Use mock objects to substitute for the database interactions, immediately returning a realistic set of database interaction results without the actual database interaction. By putting your process under load without the database interactions, you can determine how much time the non-database interactive process removes from the overall time allowed for the full process. The remaining time is the window that all the database interactions need to perform within.
Note: Once you know how much time is available for your database interactions, you need to step back and determine if your business logic is allowing sufficient time for the database interactions. If the business logic is taking all the time allowed to meet the SLAs, then your SLAs might be unrealistic and need to be revisited. If your business logic doesn’t allow sufficient time for database interactions, you need to increase your SLAs, streamline your business logic, or reduce the number of database interactions to be performed.
Once you have the number of database interactions, and the time within which they need to be performed, it’s a simple matter of making a few calculations to determine how much time within which each database interaction has to return results. From here, you must determine what the mix is of the database interactions. Are they all K/V access? Are they all N1QL queries? A mix of both? As for your N1QL queries, are they all simple queries, or are there some complex queries mixed in. It is from looking at the mix of interactions that you can create a scaled mix of database SLA, where your K/V accesses have one SLA, your simple N1QL queries have another, and your complex queries have a third. Understand that these will be average target SLAs for your database interactions. Not every interaction will meet any specific SLA, but the average should meet or exceed the average.
Once you have your target SLAs defined, you need to determine how much performance degradation is acceptable. As the load grows on your cluster, there is a point where the performance of both your application and database will start to degrade. You need to determine how much of this degradation is acceptable. This number provides you with your line-in-the-sand beyond which you can’t allow your performance to degrade any further. This will be the breaking point in the load that your cluster can handle. This will be used as a load milepost to determine when your cluster needs to be reconfigured to handle more load to prevent further degradation. Remember that the goal of this testing is to produce a roadmap for your cluster configuration, so you know how much load each cluster configuration can handle, and determine the milepost markers that will signal it’s time to reconfigure your cluster to handle increases in load.
Find Where The Current Configuration Breaks
Now that you have your goal SLA for the complete process, and have a known non-database interaction average performance time, and you’ve calculated what your database interactions should be, you need to determine if your current Couchbase cluster configuration can meet these SLAs.
First off, under a generic load-generation tool like pillowfight or n1qlback, does your cluster meet the response time SLAs needed, or do you need to raise your SLAs to provide a larger processing window? Assuming that your SLAs are theoretically achievable, the next question is whether your current application and Couchbase cluster configuration can meet your SLAs. Once you’ve verified that the desired SLAs are achievable by your full-stack business process, both application and database cluster, now it’s time to increase the load to see how everything scales.
Once you’ve confirmed that your SLAs are achievable with your current configuration, you could immediately increase the load to your target user/process load and see how it performs, but that might not be a good idea. If you significantly increase the load suddenly only to find that you are no longer meeting your SLAs, you have gathered no useful information. You may know that your cluster will need to scale before that load level is reached, but you don’t know when you must scale. In short, you have no actionable information yet.
What you want to do is gradually add load while watching the response time to determine when the current configuration performance degrades beyond the desired SLAs, making note of when this threshold is crossed. You also want to continue increasing load until the performance degrades to where the second, degraded SLAs are crossed. This also needs to be noted so you have the two load mileposts, marking when you need to scale your cluster, and what load you need o have completed your cluster scaling before reaching.
Test With Multiple Growth Configurations
Once you’ve identified the load that degrades the performance of your cluster to an unacceptable degree, the next question that you have to answer is which Couchbase service cannot handle the load. Is it K/V access that is suffering? N1QL queries? Does the index service have a large queue of outstanding updates that doesn’t seem to keep up? Are any of the nodes showing a CPU load close to 100%? These are some clues to look for to determine how to scale your Couchbase cluster.
Once you’ve added one or two nodes to the particular service that you’ve determined to be the one suffering under the load, repeat the load scale-up process until you again find the load that degrades the performance. Did adding the node make any difference? If not, then you may have identified the wrong service to scale. If it made a difference, and your cluster can handle a higher load, then you’ve got a couple more milepost markers for scaling.
Again, you repeat the same analysis as before to determine which service to scale this time. Maybe it’ll be the same as before, maybe it’ll be a different service. The key is to make sure that each time you add nodes to your cluster to handle an increased load, be certain you are adding them to the service that needs to be scaled.
Note: If the service that is being scaled is the index service, remember that adding a node and running a rebalance does not make all the needed configuration changes. To move indexes from one node to another during a rebalance, the node the indexes are on before the rebalance needs to be marked for deletion. Otherwise the indexes will not move. If you want to just add one or two more nodes to a running cluster using Kubernetes, you may need to manually move individual indexes from the existing nodes over to the new node to spread the load.
Test Operations While Under Load
Knowing what load degrades the performance of your cluster configuration isn’t enough information for planning the start of scaling. You also need to know when best to change your cluster configuration. Scaling your Couchbase cluster isn’t just a matter of adding a node and turning it on. The cluster will require that a rebalance be run before the new node will start taking any load as part of the cluster. How is the rebalance going to affect your SLAs? What load can you continue to run simultaneously with a rebalance and continue to meet your SLAs?
The only way you will answer these questions is by taking the same approach, starting with a very light load and gradually increasing the load simultaneously with a node-adding rebalance running. Once you have identified an acceptable load to have while running a rebalance, you can look at your usage patterns to identify a day and time when it is preferable to run any configuration changing rebalances.
Identify Multiple Growth Target Configurations
Once you have your various metrics of load capacity, both with and without operational processes, you can build your plan for anticipating and adapting for growth. Your plan should include milepost triggers for identifying when to scale your cluster, how to scale it, and when to schedule the scaling.
The resulting growth plan should specify some metric that will act as a trigger for each scale-up event, and what should be done as part of the scale-up (e.g. add 1 K/V node, 2 query nodes). When these milepost triggers are reached in your production environment, schedule the appropriate changes to your cluster, implementing them at the earliest appropriate window (low-load time). If your cluster load is seasonal, these scale ups and downs can and should be planned well ahead of time.
If the changes in load are not seasonal, but result from organic growth, your milepost map should include additional markers at around the 80% mark of the next cluster-change milepost. These 80% markers should be considered as “get ready” mileposts, alerting your team that the load on your cluster is growing and you need to be planning on scaling your cluster soon, making the necessary preparatory arrangements.