Database users of every kind have come to expect cost-based optimization for their queries.
It’s a standard feature of all traditional relational databases (RDBMS), but there hasn’t yet been a cost-based optimizer (CBO) for NoSQL document databases. That is, until now.
Cost-based optimization for Couchbase queries was released as a Developer Preview in Couchbase 6.5, and it’s now ready for general availability with Couchbase 7.0.
Let’s take a closer look at the new CBO features of Couchbase Server 7.0.
The Old Way: Rule-Based Optimization
It’s impossible to discuss cost-based optimization without referring to the previous approach to database query optimization: rule-based optimization (RBO).
With rule-based optimization, the query optimizer considers using any index that provides a path to retrieve the query result. However, when the optimizer is presented with more than one path to process the query – that is, when the query could use multiple indexes – it could end up using all indexes.
With cost-based optimization, the query optimizer calculates the cost of using each index, and then decides on the execution plan with the lowest cost.
What Is Cost-Based Optimization?
Cost-based optimization enables the database query optimizer to choose the best (lowest-cost) plan to process a query.
This matters most whenever a query has more than one potential execution path, i.e., more than one qualified index. Each query plan path is assigned a total cost for all of the operations that the query needs to perform. These costs particularly depend on the specified predicates’ selectivity. To arrive at the selectivity value of a predicate, the statistics of the predicate fields need to have been pre-computed.
Please note that the cost value is an arbitrary unit of statistics calculation: it does not equate to the resource cost or CPU usage.
Cost-Based Optimization vs Rule-Based Optimization
Cost-based query optimization always provides a better execution path when the optimizer is presented with more than one option. The computed statistics ensure that the optimizer is equipped with the knowledge about the data distribution, and subsequently the efficiency of one query plan versus another.
For instance, imagine a Couchbase Collection that contains customer information and their addresses, with an index on ix1(zipcode)
and ix2(city)
. Statistically, there are more ZIP codes than there are cities in the U.S. But that doesn’t necessarily mean that using the ix1(zipcode)
index has a lower cost (and therefore better performance) than the ix2(city)
index. Why? Because we don’t yet know if the customers in the system are evenly distributed across all U.S. cities and ZIP codes. Chances are that they’re not.
In this scenario, it’s the statistics collected by the cost-based optimizer beforehand that help it decide the lowest-cost query path at runtime.
With rule-based optimization, if a query has both city and ZIP code predicates, then the query optimizer uses both indexes with intersect scan. However, the CBO only uses one index – whichever one has the lowest cost based on the data within the Collection.
How to Enable Cost-Based Optimization in Couchbase
There are two important steps you need to take in order to use cost-based optimization in Couchbase Server 7.0:
1. Enable the CBO Feature
Currently the cost-based optimizer is enabled by default in Couchbase 7.0.
However, you can disable it with the N1QL Feature Control value in Settings->Query Settings. If you plan to use cost-based optimization, then make sure it’s enabled.
2. Collect Statistics
In addition to having the cost-based optimizer enabled in your cluster, you must also have collected statistics on the underlying fields for the indexes used by your queries. Here’s a sample query on how you’d do that:
1 |
UPDATE STATISTICS FOR<keyspace> (<index_expr1>,..) |
For example, using the travel-sample
dataset, this above query would look like this:
1 |
UPDATE STATISTICS FOR `travel-sample`._default.hotel (city); |
You can also collect the statistics according to the index name:
1 |
UPDATE STATISTICS FOR `travel-sample`._default.route.def_route_schedule_utc; |
Note that you can also disable the cost-based optimizer by either using the cbo_flag
in your API, or by deleting the statistics with UPDATE STATISTICS <keyspace> DELETE (<index_expr1>,..)
.
The CBO System Bucket
When you’ve enabled cost-based optimization and you’ve collected statistics, you’ll notice an additional Bucket in your cluster called N1QL_SYSTEM_BUCKET.N1QL_SYSTEM_SCOPE.N1QL_CBO_STAT
. Direct access to this Bucket is restricted to system administrators.
Using the UPDATE STATISTICS
command, you can either add more stats to the Bucket or delete the stats with UPDATE STATISTICS FOR <keyspace> DELETE
.
Cost-Based Optimizer in Action
Let’s take a closer look at how cost-based optimization works in Couchbase 7.0 using the
dataset in the following scenario.travel-sample
Improved Index Selection with Cost-Based Optimization
Let’s say you have defined two indexes for the
Collection.travel-sample
._default.airport
1 2 |
CREATE INDEX def_airport_city_country ON airport(city,country); CREATE INDEX def_airport_faa ON airport(faa); |
And you have a query below:
1 |
SELECT * FROM airport WHERE city = "New York" AND faa = "JFK"; |
If the above cost-based optimization weren’t used, the rules-based optimizer execution plan for the above query would be:
Note that since the query predicates are on both the city
and faa
fields, both indexes are used in the intersect scan. Learn more about intersect scans in this article: Performance Ingredients for NoSQL: Intersect Scans in N1QL.
Next, you enable cost-based optimization for the database query. Since CBO is already enabled by default, all you need to do is to collect the statistics.
1 |
UPDATE STATISTICS FOR airport(city, faa); |
After the statistics are collected, run the SELECT
query again.
You’ll see that the execution plan changes to using only the airport.faa
index. This change is because the statistics are telling the optimizer that the faa
field has a better selectivity compared to the city
field. Therefore the airport.faa
index is the more efficient index for the query to use.
A Better JOIN with Cost-Based Optimization
Let’s look at another cost-based query optimization example that illustrates how the cost-based optimizer chooses a better JOIN type between airport
and route
.
First, we create an index to track the source airport:
1 |
CREATE INDEX `def_route_sourceairport` ON `travel-sample`.`_default`.`route`(`sourceairport`) |
And then we use the query below:
1 |
SELECT DISTINCT route.destinationairport FROM airport JOIN route ON airport.faa = route.sourceairport; |
With the old rules-based optimizer, the execution plan for the above query would look like:
Note that the RBO execution plan above uses a nested loop to JOIN the airport Collection to the route Collection.
Next, you need to enable cost-based optimization for the query. Since CBO is already enabled by default, all you have to do is to collect the statistics for the field in the Collection.
1 2 |
UPDATE STATISTICS FOR route(sourceairport); UPDATE STATISTICS FOR airport(faa); |
With cost-based optimization stats available on the two underlying fields for the indexes, the optimizer switches over to using a HASH
for the Collection JOIN.
Cost-Based Optimizer Troubleshooting in Couchbase
- Cost-based optimization is enabled by default in Couchbase 7.0. However, statistics need to be present on the specific fields in your Bucket/Collections before the cost-based optimizer functions effectively.
- You can disable cost-based optimization at the cluster level using the N1QL Feature Control field by adding 16 to the value. It is a set with x10 bit.
- CBO statistics are not updated automatically by the Query Service. You should collect statistics on a regular basis, or after a significant change to your Collection content. I recommend collecting statistics after a large upload of data; otherwise, a weekly run should suffice.
- Cost-based optimization is only considered for a query if the statistics are available for all of the indexes that were being considered by the optimizer. You can collect the statistics for all the indexes on a given keyspace using the following:
12UPDATE STATISTICS FOR <collection> INDEX((SELECT RAW name from system:indexesWHERE state='online' AND keyspace_id='<collection_name>' AND bucket_id='<bucket_name>' AND scope_id='<scope_name>'))
More Resources
To learn more about the cost-based optimizer for JSON – and the Couchbase 7.0 release in general – check out the following resources:
Great article, thanks. Minor typo in last paragraph of section “A better JOIN with CBO”. The word NOT should be NOW. As in
…, the optimizer will noW switch over to using a HASH for the collection JOIN.