Introduction
IN list is commonly used in N1QL queries. Couchbase Server 6.5 introduces a couple of enhancements in handling of IN list in N1QL queries.
Handling of Large IN List
Before Couchbase Server 6.5 N1QL handles IN list evaluation at runtime in a simple fashion, by comparing with each element of the IN list in turn until a match is found. This can be inefficient if the IN list is large, i.e., there are many elements in the IN list. Couchbase Server 6.5 introduces an improvement for handling of large IN list at runtime. When the number of elements in the IN list exceeds a threshold (currently 16), it builds a hash table with all the elements in the IN list. This way each evaluation of the IN list predicate becomes a probe of the hash table instead of doing many comparisons with each element of the IN list.
“Static” IN list
For this optimization to work, the IN list must be “static”, i.e., the values of each element of the IN list cannot change during one execution of the query. You can have an IN list where all elements are constants or query parameters, for example:
1 2 3 |
document.field IN [1, 2, 3, ......, 1024] document.field IN [$1, $2, $3, ......, $128] |
The …… above is just a shorthand, you’ll need to actually fill in all the values if you want to run the query.
You can mix constants and query parameters in the same IN list:
1 |
document.field IN [1, 2, 3, 4, 5, $1, $2, $3, $4, $5, 6, 7, 8, 9, 10, ......, 512] |
You can also use a single query parameter for the entire IN list:
1 |
document.field IN $inlist |
and pass in the actual IN list as a query parameter.
This optimization of IN list evaluation will not kick in if the IN list is not “static”, for example, if any element of the IN list references any field in the document:
1 |
document.field IN [document.c1, 2, ......, 1024] |
In this case the value of the first element in the IN list changes during execution, and thus we cannot use the hash table optimization for evaluation of this IN list.
IN clause with subquery
As a special case of IN list, you can have an IN clause with a subquery:
1 |
document.field IN (SELECT RAW <expr> FROM ...... WHERE ......) |
When the RAW keyword is used and there is a single expression in the projection list, the result of the subquery is an array of values. When the subquery is not correlated, i.e., no reference to any fields in the parent query, then the new optimization of using a hash table for the IN list predicate evaluation can also be applied here. At runtime, after the subquery is executed and the result of the subquery execution is known, if the result set contains more values than the threshold, a hash table will be built with all values in the result set, and subsequent evaluation of the subquery predicate becomes a probe to the hash table. This IN list optimization can potentially be applied multiple times in case of nested subqueries:
1 2 3 4 |
document1.field1 IN (SELECT RAW <expr1> FROM ...... WHERE document2.field2 IN (SELECT RAW <expr2> FROM ...... WHERE ......) ) |
as long as both subqueries are not correlated, and the result size of each subquery exceeds the threshold.
NOT IN handling
The optimization for using a hash table for IN list evaluation is also applicable for NOT IN predicate. NOT IN predicate is evaluated the same way as IN predicate and the result boolean value is simply reversed.
Visibility of the hash table optimization for IN list evaluation
The optimization for using a hash table to evaluate large IN list is purely an internal optimization. There is little externally visible effect, except the query runs faster. Due to the use of a hash table, memory consumption also increases slightly during execution of the query. The additional memory consumption obviously depends on the size of the hash table, which in turn depends on the number of elements in the IN list.
Dynamic Index Span Generation for Query Parameter as IN List
For an IN list predicate, if an appropriate index exists, the planner can generate an index scan with multiple spans, one for each element of the IN list. For example:
1 2 3 |
CREATE INDEX ix1 ON default(c1); EXPLAIN SELECT 1 FROM default WHERE c1 IN [1, 2, 3]; |
The index scan will have 3 spans generated:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
"spans": [ { "exact": true, "range": [ { "high": "1", "inclusion": 3, "low": "1" } ] }, { "exact": true, "range": [ { "high": "2", "inclusion": 3, "low": "2" } ] }, { "exact": true, "range": [ { "high": "3", "inclusion": 3, "low": "3" } ] } ], |
The same is true if query parameters are used as individual elements in the IN list:
1 |
EXPLAIN SELECT 1 FROM default WHERE c1 IN [$1, $2, $3]; |
The following index spans are generated:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
"spans": [ { "exact": true, "range": [ { "high": "$1", "inclusion": 3, "low": "$1" } ] }, { "exact": true, "range": [ { "high": "$2", "inclusion": 3, "low": "$2" } ] }, { "exact": true, "range": [ { "high": "$3", "inclusion": 3, "low": "$3" } ] } ], |
However, if a query parameter is used as the entire IN list:
1 |
EXPLAIN SELECT 1 FROM default WHERE c1 IN $inlist; |
Then a single index span is generated:
1 2 3 4 5 6 7 8 9 10 11 12 |
"spans": [ { "exact": true, "range": [ { "high": "array_max($inlist)", "inclusion": 3, "low": "array_min($inlist)" } ] } ], |
Notice the low and high boundary of the index span is array_min($inlist) and array_max($inlist) respectively. When a query parameter is used for the entire IN list, at compile time the size of the IN list is unknown, thus it’s impossible to pre-generate index spans for each element of the IN list. Using the current index span generated the index scan needs to scan between the minimum and the maximum values of the array elements. For example, if the query parameter provided for the query execution is [1, 1000, 1000000], then everything in between 1 and 1000000 is scanned by the index. This is very inefficient and involves a lot of unnecessary work.
In Couchbase Server 6.5, the handling of such cases is improved by dynamically generating index spans at runtime. When the query is being executed, the value of the query parameter is now known, and thus the number of elements of the IN list is also known. The execution engine will now try to mimic the index spans generated when the IN list is known at compile time, by dynamically generating a separate index span for each element of the IN list. For example, if [1, 1000, 1000000] is specified as $inlist for query execution, the runtime execution engine will generate 3 index spans, similar to when [1, 1000, 1000000] is specified as a constant in the query. This allows the index scan to be much more precise and avoids scanning unnecessary regions of the index.
The explain plan for the query remains the same, still showing array_min($inlist) and array_max($inlist) as low and high bounds of the index span. A new “dynamic_in” indicator is added to the index span:
1 2 3 4 5 6 7 8 9 10 11 12 |
"spans": [ { "range": [ { "dynamic_in": true, "high": "array_max($inlist)", "inclusion": 3, "low": "array_min($inlist)" } ] } ], |
Dynamic index span generation happens at runtime. To avoid potential performance issues, there is a limit of 8192 spans when index spans are generated dynamically. If the IN list passed in (as query parameter) has more than 8192 elements, this optimization will not kick in.
Subquery handling and potential query rewrite
The above example shows the optimization of dynamically generating index spans when the IN list is a query parameter. This optimization is also applicable when the IN list is a WITH variable or a function parameter. WITH variable comes from another new feature in Couchbase Server 6.5 – Common Table Expression (see https://docs.couchbase.com/server/current/n1ql/n1ql-language-reference/with.html); function parameter comes from another new developer preview feature in Couchbase Server 6.5 – User Defined Functions (see https://docs.couchbase.com/server/6.5/n1ql/n1ql-language-reference/userfun.html).
However, when the IN clause uses a subquery, this optimization is not applicable. For example:
1 2 3 4 5 6 7 8 |
SELECT DISTINCT RAW route.airline FROM `travel-sample` AS route WHERE route.type = "route" AND route.sourceairport IN (SELECT RAW airport.faa FROM `travel-sample` AS airport WHERE airport.type = "airport" AND airport.airportname = "Seattle Tacoma Intl"); |
In this example the IN clause has a subquery on the right-hand side. If you have the following indexes defined:
1 2 |
CREATE INDEX ix_route_1 ON `travel-sample`(sourceairport, airline) WHERE type = "route"; CREATE INDEX ix_airport_1 ON `travel-sample`(airportname, faa) WHERE type = "airport"; |
Then the explain for the above query shows an index scan on ix_route_1 with the following spans:
1 2 3 4 5 6 7 8 9 10 |
"spans": [ { "range": [ { "inclusion": 0, "low": "null" } ] } ], |
This means the index scans all the non-NULL values.
It is possible to manually rewrite the above query to use a WITH clause:
1 2 3 4 5 6 7 8 |
WITH faas AS (SELECT RAW airport.faa FROM `travel-sample` AS airport WHERE airport.type="airport" AND airport.airportname = "Seattle Tacoma Intl") SELECT DISTINCT RAW route.airline FROM `travel-sample` AS route WHERE route.type = "route" AND route.sourceairport IN faas; |
Looking at the explain for this rewritten query, now the index scan on ix_route_1 has the following spans:
1 2 3 4 5 6 7 8 9 10 11 12 |
"spans": [ { "range": [ { "dynamic_in": true, "high": "array_max(`faas`)", "inclusion": 3, "low": "array_min(`faas`)" } ] } ], |
For the rewritten query now it can take advantage of the optimization to dynamically generate index spans at runtime.
This manual rewrite of the query to try to take advantage of the new optimization of dynamic index span generation is only possible when the subquery is not correlated. The manual query rewrite is not limited to a single subquery. You can do similar manual query rewrite for nested subqueries as well, as long as the subqueries involved are not correlated. For example, given the following query:
1 2 3 4 5 6 7 8 9 10 11 12 |
SELECT RAW airline.callsign FROM `travel-sample` AS airline WHERE airline.type = "airline" AND airline.iata IN (SELECT DISTINCT RAW route.airline FROM `travel-sample` AS route WHERE route.type = "route" AND route.sourceairport IN (SELECT RAW airport.faa FROM `travel-sample` AS airport WHERE airport.type = "airport" AND airport.airportname = "Seattle Tacoma Intl")); |
It can be rewritten to the following query such that it can take advantage of the optimization to dynamically generate index spans:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
WITH airlines AS ( WITH faas AS (SELECT RAW airport.faa FROM `travel-sample` AS airport WHERE airport.type = "airport" AND airport.airportname = "Seattle Tacoma Intl") SELECT DISTINCT RAW route.airline FROM `travel-sample` AS route WHERE route.type = "route" AND route.sourceairport IN faas) SELECT RAW airline.callsign FROM `travel-sample` AS airline WHERE airline.type = "airline" AND airline.iata IN airlines; |
If you have the following index defined:
1 |
CREATE INDEX ix_airline_1 ON `travel-sample`(iata, callsign) WHERE type = "airline"; |
The explain for the rewritten query shows similar index spans on index ix_airline_1:
1 2 3 4 5 6 7 8 9 10 11 12 |
"spans": [ { "range": [ { "dynamic_in": true, "high": "array_max(`airlines`)", "inclusion": 3, "low": "array_min(`airlines`)" } ] } ], |
indicating the optimization of dynamic index span generation is in play.
Summary
Couchbase Server 6.5 introduces a couple of enhancements for IN list handling. When the IN list is large, a hash table is used at runtime for IN list evaluation. When a query parameter is specified as the entire IN list, dynamic index span generation at runtime allows more efficient index scan to be performed. Both of these optimizations make the handling of IN list much more efficient. In cases where the IN clause has an uncorrelated subquery, it is possible to manually rewrite the query using common table expressions, to take advantage of the optimization of dynamic generation of index spans at runtime.
Sorry for using the blog as a communication medium.. but.. the contact form on your site doesn’t work, and the contact email bounces saying that only “members” can send emails.
Is Couchbase still alive and kicking? :/
Have you been able to talk to anyone?
Hi Denis.
I was contacted by Bingjie.
Thanks for replying here :)
Very helpful info, especially hash table and CTE for query optimization.