Bad performance with named parameters IN set / list

unhuman · May 15, 2018, 7:46pm

Trying to used named parameters in a N1QL query and that seems to work fine for equals values, however, our query has IN and a List is provided, the query plan changes and performance is awful.

our query:

SELECT count(1) FROM `bucket` WHERE `user` is not null" +
            " AND `user`.`type` = $accountType" +
            " AND `user`.`id` = $accountId" +
            " AND split(meta().id, '::')[0] in ["e1", "e2", "e3", "e4"];

or (with replacement for IN)

SELECT count(1) FROM `bucket` WHERE `user` is not null" +
            " AND `user`.`type` = $accountType" +
            " AND `user`.`id` = $accountId" +
            " AND split(meta().id, '::')[0] in $environments";

Setting up the $accountType and $accountId work just fine, but $environments is a List of String. When that comes into play, the query plan changes and performance is very bad (timeout when using a direct string instead works in 100ms).

The String based query plan is like this:

{
    "plan":
    {
        "~children": [
        {
            "scan":
            {
                "keyspace": "bucket",
                "spans": [
                {
                    "Range":
                    {
                        "High": ["\"e1\"", "$accountType", "successor($accountId)"],
                        "Low": ["\"e1\"", "$accountType", "$accountId"],
                        "Inclusion": 1
                    }
                },
                {
                    "Range":
                    {
                        "High": ["\"e2\"", "$accountType", "successor($accountId)"],
                        "Low": ["\"e2\"", "$accountType", "$accountId"],
                        "Inclusion": 1
                    }
                },
                {
                    "Range":
                    {
                        "High": ["\"e3\"", "$accountType", "successor($accountId)"],
                        "Low": ["\"e3\"", "$accountType", "$accountId"],
                        "Inclusion": 1
                    }
                },
                {
                    "Range":
                    {
                        "High": ["\"e4\"", "$accountType", "successor($accountId)"],
                        "Low": ["\"e4\"", "$accountType", "$accountId"],
                        "Inclusion": 1
                    }
                }],
                "#operator": "IndexScan",
                "using": "gsi",
                "namespace": "default",
                "index": "user_idx",
                "index_id": "619388af1e312812",
                "covers": ["cover ((split((meta(`bucket`).`id`), \"::\")[0]))", "cover (((`bucket`.`user`).`type`))", "cover (((`bucket`.`user`).`id`))", "cover (((`bucket`.`user`).`ttlInSec`))", "cover ((meta(`bucket`).`id`))"],
                "filter_covers":
                {
                    "cover (((`bucket`.`user`) is not null))": true
                }
            },
            "#operator": "DistinctScan"
        },
        {
            "~child":
            {
                "~children": [
                {
                    "#operator": "Filter",
                    "condition": "(((cover (((`bucket`.`user`) is not null)) and (cover (((`bucket`.`user`).`type`)) = $accountType)) and (cover (((`bucket`.`user`).`id`)) = $accountId)) and (cover ((split((meta(`bucket`).`id`), \"::\")[0])) in [\"e1\", \"e2\", \"e3\", \"e4\"]))"
                },
                {
                    "group_keys": [],
                    "#operator": "InitialGroup",
                    "aggregates": ["count(1)"]
                }],
                "#operator": "Sequence"
            },
            "#operator": "Parallel"
        },
        {
            "group_keys": [],
            "#operator": "IntermediateGroup",
            "aggregates": ["count(1)"]
        },
        {
            "group_keys": [],
            "#operator": "FinalGroup",
            "aggregates": ["count(1)"]
        },
        {
            "~child":
            {
                "~children": [
                {
                    "#operator": "InitialProject",
                    "result_terms": [
                    {
                        "as": "count",
                        "expr": "count(1)"
                    }]
                },
                {
                    "#operator": "FinalProject"
                }],
                "#operator": "Sequence"
            },
            "#operator": "Parallel"
        }],
        "#operator": "Sequence"
    },
    "text": "SELECT count(1) as count FROM `bucket` WHERE `user` is not null AND `user`.`type` = $accountType AND `user`.`id` = $accountId AND split(meta().id, '::')[0] in ['e1','e2','e3','e4']"
}

and the named parameter based list returns:

{
    "plan":
    {
        "~children": [
        {
            "keyspace": "bucket",
            "spans": [
            {
                "Range":
                {
                    "Low": ["null", "$accountType", "$accountId"],
                    "Inclusion": 0
                }
            }],
            "#operator": "IndexScan",
            "using": "gsi",
            "namespace": "default",
            "index": "user_idx",
            "index_id": "619388af1e312812",
            "covers": ["cover ((split((meta(`bucket`).`id`), \"::\")[0]))", "cover (((`bucket`.`user`).`type`))", "cover (((`bucket`.`user`).`id`))", "cover (((`bucket`.`user`).`ttlInSec`))", "cover ((meta(`bucket`).`id`))"],
            "filter_covers":
            {
                "cover (((`bucket`.`user`) is not null))": true
            }
        },
        {
            "~child":
            {
                "~children": [
                {
                    "#operator": "Filter",
                    "condition": "(((cover (((`bucket`.`user`) is not null)) and (cover (((`bucket`.`user`).`type`)) = $accountType)) and (cover (((`bucket`.`user`).`id`)) = $accountId)) and (cover ((split((meta(`bucket`).`id`), \"::\")[0])) in $environments))"
                },
                {
                    "group_keys": [],
                    "#operator": "InitialGroup",
                    "aggregates": ["count(1)"]
                }],
                "#operator": "Sequence"
            },
            "#operator": "Parallel"
        },
        {
            "group_keys": [],
            "#operator": "IntermediateGroup",
            "aggregates": ["count(1)"]
        },
        {
            "group_keys": [],
            "#operator": "FinalGroup",
            "aggregates": ["count(1)"]
        },
        {
            "~child":
            {
                "~children": [
                {
                    "#operator": "InitialProject",
                    "result_terms": [
                    {
                        "as": "count",
                        "expr": "count(1)"
                    }]
                },
                {
                    "#operator": "FinalProject"
                }],
                "#operator": "Sequence"
            },
            "#operator": "Parallel"
        }],
        "#operator": "Sequence"
    },
    "text": "SELECT count(1) as count FROM `bucket` WHERE `user` is not null AND `user`.`type` = $accountType AND `user`.`id` = $accountId AND split(meta().id, '::')[0] in $environments"
}

vsr1 · May 15, 2018, 8:05pm

IndexScans are based on range scans (i.e low, high).
The right side of the IN clause requires array (i.e key IN $environments ) if it is named/query parameter the value will not know at prepare time (because query plan is generated at prepare time). So query plan can’t convert the array values into multiple range values of OR clause. It request indexScan on whole range of that key (some times BETWEEN ARRAY_MIN($enviroments) AND ARRAY_MAX($environments) ) reapply predicates in query engine. This causes lot of false positives and impacts performance.

At present you have one of the following options.

If you have finite array elements try individual element as query parameters
ex: key IN [ $qp1, $qp2, $qp3]
adhoc=true
In example above you can use equality and repeat query for each IN element and add the result in client may perform better because If it uses IndexCountScan
Also if user.type, user.id always equality predicates move those leading , followed by split(meta().id, ‘::’)[0] in inedex keys. This makes named parameters explain will have high range

unhuman · May 15, 2018, 8:28pm

Unfortunately, no. It varies. Although, we could generate the query to define each sub parameter.
adhoc(true) was actually our query mode. Didn’t work.
Yep, that’s similar to IN and would be similar to 1.
That could help, but essentially would require a scan of the results… That winds up being slow.

I think it may be worth trying 1.

Thanks - H

vsr1 · May 15, 2018, 8:34pm

You should try with 3 & 4

CREATE INDEX ix1 ON `bucket`(`user`.type, `user`.accountId, split(meta().id, '::')[0]);

SELECT count(1) FROM `bucket` WHERE  `user`.`type` = $accountType  AND `user`.`id` = $accountId AND split(meta().id, '::')[0]  = $e1;

In the client
        cnt = 0;
        for v IN environments
             set  e1 named parameter to v
             execute query
             cnt = cnt + result from query
       end for

unhuman · May 15, 2018, 8:37pm

Oh making multiple queries? Later in our code we actually use the count to do other work (querying data again)… Best to keep it all together.

I’d prefer to do one of the following:

SELECT count(1) FROM `bucket` WHERE  `user`.`type` = $accountType  AND `user`.`id` = $accountId AND split(meta().id, '::')[0]  IN [ $e1, $e2, $e3, ....];

or

SELECT count(1) FROM `bucket` WHERE  `user`.`type` = $accountType  AND `user`.`id` = $accountId AND 
(split(meta().id, '::')[0]  = $e1 or split(meta().id, '::')[0]  = $e2 or split(meta().id, '::')[0]  = $e3 ....);

vsr1 · May 15, 2018, 8:43pm

In that case change the index suggested and try your original query you should see some improvement.

"keyspace": "bucket",
            "spans": [
            {
                "Range":
                {
                    "Low": ["null", "$accountType", "$accountId"],
                    "Inclusion": 0
                }
            }],

As you see above there is no high value (if any key high missing, following keys high will not pushed to indexer). example accountId=10 it gets all accountIds>=10 and applied filter. That will change with new index.

FYI: 5.0+ this has been improved

unhuman · May 15, 2018, 9:37pm

I changed it to :

SELECT count(1) FROM `bucket` WHERE  `user`.`type` = $accountType  AND `user`.`id` = $accountId AND split(meta().id, '::')[0]  IN [ $e1, $e2, $e3, ....];

And the existing index is fine. The query plan is very similar to the first:

{
    "plan":
    {
        "~children": [
        {
            "scan":
            {
                "keyspace": "bucket",
                "spans": [
                {
                    "Range":
                    {
                        "High": ["$e0", "$accountType", "successor($accountId)"],
                        "Low": ["$e0", "$accountType", "$accountId"],
                        "Inclusion": 1
                    }
                },
                {
                    "Range":
                    {
                        "High": ["$e1", "$accountType", "successor($accountId)"],
                        "Low": ["$e1", "$accountType", "$accountId"],
                        "Inclusion": 1
                    }
                },
                {
                    "Range":
                    {
                        "High": ["$e2", "$accountType", "successor($accountId)"],
                        "Low": ["$e2", "$accountType", "$accountId"],
                        "Inclusion": 1
                    }
                },
                {
                    "Range":
                    {
                        "High": ["$e3", "$accountType", "successor($accountId)"],
                        "Low": ["$e3", "$accountType", "$accountId"],
                        "Inclusion": 1
                    }
                }],
                "#operator": "IndexScan",
                "using": "gsi",
                "namespace": "default",
                "index": "user_idx",
                "index_id": "619388af1e312812",
                "covers": ["cover ((split((meta(`bucket`).`id`), \"::\")[0]))", "cover (((`bucket`.`user`).`type`))", "cover (((`bucket`.`user`).`id`))", "cover (((`bucket`.`user`).`ttlInSec`))", "cover ((meta(`bucket`).`id`))"],
                "filter_covers":
                {
                    "cover (((`bucket`.`user`) is not null))": true
                }
            },
            "#operator": "DistinctScan"
        },
        {
            "~child":
            {
                "~children": [
                {
                    "#operator": "Filter",
                    "condition": "(((cover (((`bucket`.`user`) is not null)) and (cover (((`bucket`.`user`).`type`)) = $accountType)) and (cover (((`bucket`.`user`).`id`)) = $accountId)) and (cover ((split((meta(`bucket`).`id`), \"::\")[0])) in [$e0, $e1, $e2, $e3]))"
                },
                {
                    "group_keys": [],
                    "#operator": "InitialGroup",
                    "aggregates": ["count(1)"]
                }],
                "#operator": "Sequence"
            },
            "#operator": "Parallel"
        },
        {
            "group_keys": [],
            "#operator": "IntermediateGroup",
            "aggregates": ["count(1)"]
        },
        {
            "group_keys": [],
            "#operator": "FinalGroup",
            "aggregates": ["count(1)"]
        },
        {
            "~child":
            {
                "~children": [
                {
                    "#operator": "InitialProject",
                    "result_terms": [
                    {
                        "as": "count",
                        "expr": "count(1)"
                    }]
                },
                {
                    "#operator": "FinalProject"
                }],
                "#operator": "Sequence"
            },
            "#operator": "Parallel"
        }],
        "#operator": "Sequence"
    },
    "text": "SELECT count(1) as count FROM `bucket` WHERE `user` is not null AND `user`.`type` = $accountType AND `user`.`id` = $accountId AND split(meta().id, '::')[0] in [$e0,$e1,$e2,$e3]"
}

Thanks so much for the suggestion! Big help to have less bad code.

unhuman · May 21, 2018, 5:38pm

@vsr1 So, following up a bit. Although the query plan looks good, sometimes we are seeing these queries time out, still.

I found that manually querying IN […] within our application, we see timeouts. So, I take the same query to the Admin UI, and query - takes a long time. Then, I queried just for = one of the items in the list (for all the items). Those were fast. Then, queried the original query, and it’s fast…

Now, we could break up our query into 4 queries (or X as the case may be) as you mentioned earlier, but this seems pretty weird…

All of this is currently anecdotal - I need to be able to find / duplicate / etc the case.

-H

vsr1 · May 21, 2018, 7:10pm

@unhuman, Is the number of IN elements are more when timeout, what about load, you can increase the timeout. In 5.x lot of these are improved https://developer.couchbase.com/documentation/server/current/performance/index_pushdowns.html

Topic		Replies	Views
Named Parameters in the Web Console Couchbase Server query , n1ql	31	3928	January 23, 2024
How to create N1QL query with array parameter .NET SDK query , n1ql , dot-net	3	2623	September 5, 2023
N1QL Index performance SQL++	3	574	May 14, 2019
ParameterizedN1qlQuery with IN statement SQL++	17	5755	June 27, 2016
N1ql Query Named Params are not being applied on PHP SDK? PHP SDK query , n1ql	1	1829	September 12, 2016

Bad performance with named parameters IN set / list

Related topics