Array with DISTINCT is slow

revathi · April 5, 2018, 7:30pm

Hello,

Document:
{
id: “123”
type: “report”
data: [
{
“id”: 0,
“required”: true,
“value”: “10000”
},
{
“id”: 1,
“required”: true,
“value”: “florida”
},
{
“group”: “”,
“id”: 2,
“required”: true,
“value”: “07/01/2017”
},
…
{
“group”: “”,
“id”: 25,
“required”: true,
“value”: “45”
},
]

Index:
CREATE INDEX doc_idx ON default((distinct (array (s.value) for s in data when ((s.id) = 17) end)))

Query:
SELECT DISTINCT (ARRAY i.value FOR i IN data WHEN i.id=17 END) FROM default Use index(doc_idx) WHERE type= ‘report’

Size: 100K documents of type=‘report’

Output:
[
{
“$1”: [
“45”
]
},
{
“$1”: [
“50”
]
}
]
When I try to run this query it almost takes 15 to 18 seconds to give 2 distinct values. I don’t see any difference with and without array index. This takes really long time. I don’t know whether I am missing something.

vsr1 · April 5, 2018, 8:40pm

If id is constant all the time you can use following query and index. NOTE: variable in the index and UNNEST alias must be same (i.e. s)

CREATE INDEX doc_idx ON default(ALL ARRAY s.value FOR s IN data WHEN s.id = 17 END) WHERE type = "report";
SELECT s.value FROM default d UNNEST d.data AS s WHERE d.type = "report" AND s.id = 17 AND s.value IS NOT NULL;

If id is not consant you can try the following.

CREATE INDEX doc_idx ON default(DISTINCT ARRAY s.id FOR s IN data END) WHERE type = “report”;
SELECT s.value FROM default AS d UNNEST d.data AS s WHERE d.type = “report” AND s.id = 17 ;

SELECT ARRAY s.value FOR s IN d.data WHEN s.id = 17 END FROM default AS d WHERE d.type = “report” AND ANY v IN d.data SATISFIES v.id = 17 END;

revathi · April 5, 2018, 10:10pm

Yes id is constant. Data array size will vary based on type. Say (type=‘report’, then data array size is 25 , type=‘xxx’ data size is 18)

FYI: I have 100,000 documents of type ‘report’ (each document will have data array of size 25)

I tried both possibilities, still taking time with distinct.

Id is constant
CREATE INDEX doc_idx ON default(ALL ARRAY s.value FOR s IN data WHEN s.id = 17 END) WHERE type = “report”;

23 seconds
SELECT distinct s.value FROM default d UNNEST d.data AS s WHERE
d.type= ‘report’ AND s.id = 17 AND s.value IS NOT NULL;

If id is not constant
CREATE INDEX doc_idx ON default(distinct array (s.value) for s in data end)
WHERE type=‘report’

25 seconds
SELECT ARRAY s.value FOR s IN d.data WHEN s.id = 17 END FROM default AS d WHERE d.type = “report” AND ANY v IN d.data SATISFIES v.id = 17 END;

59 milli seconds
SELECT ARRAY s.value FOR s IN d.data WHEN s.id = 17 END FROM default AS d WHERE d.type = “report” AND ANY v IN d.data SATISFIES v.id = 17 END LIMIT 10;

21 seconds
SELECT DISTINCT ARRAY s.value FOR s IN d.data WHEN s.id = 17 END FROM default AS d WHERE d.type = “report” AND ANY v IN d.data SATISFIES v.id = 17 END;

clinton1ql · April 5, 2018, 10:12pm

What version of Couchbase and can you post some EXPLAIN results.

vsr1 · April 5, 2018, 10:56pm

LIMIT query returns past. That means your query result size might be too high and that is taking time.
DISTINCT on projection should not take time and it showing your output.

25 seconds
SELECT ARRAY s.value FOR s IN d.data WHEN s.id = 17 END FROM default AS d WHERE d.type = “report” AND ANY v IN d.data SATISFIES v.id = 17 END;

21 seconds
SELECT DISTINCT ARRAY s.value FOR s IN d.data WHEN s.id = 17 END FROM default AS d WHERE d.type = “report” AND ANY v IN d.data SATISFIES v.id = 17 END;

revathi · April 6, 2018, 3:48pm

Version 5.1.0

Indexes:
CREATE INDEX doc_id_idx ON testdoc(type,id,data)
CREATE INDEX doc_idx ON testdoc((distinct (array (s.value) for s in data end))) WHERE (type = “report”)

Query:
SELECT DISTINCT ARRAY s.value FOR s IN d.data WHEN s.id = 17 END FROM testdoc AS d USE INDEX(doc_idx)
WHERE type= ‘report’ AND ANY v IN d.data SATISFIES v.id = 17 END;

Explain:
{
“plan”: {
“#operator”: “Sequence”,
“~children”: [
{
“#operator”: “IndexScan2”,
“covers”: [
“cover ((d.type))”,
“cover ((d.id))”,
“cover ((d.data))”,
“cover ((meta(d).id))”
],
“index”: “doc_id_idx”,
“index_id”: “c3c4309d1cd1859d”,
“index_projection”: {
“entry_keys”: [
0,
2
]
},
“keyspace”: “testdoc”,
“namespace”: “default”,
“spans”: [
{
“exact”: true,
“range”: [
{
“high”: ““report””,
“inclusion”: 3,
“low”: ““report””
}
]
}
],
“using”: “gsi”
},
{
“#operator”: “Parallel”,
“~child”: {
“#operator”: “Sequence”,
“~children”: [
{
“#operator”: “Filter”,
“condition”: “((cover ((d.type)) = “report”) and any v in cover ((d.data)) satisfies ((v.id) = 17) end)”
},
{
“#operator”: “InitialProject”,
“distinct”: true,
“result_terms”: [
{
“expr”: “array (s.value) for s in cover ((d.data)) when ((s.id) = 17) end”
}
]
},
{
“#operator”: “Distinct”
},
{
“#operator”: “FinalProject”
}
]
}
},
{
“#operator”: “Distinct”
}
]
},
“text”: “SELECT DISTINCT ARRAY s.value FOR s IN d.data WHEN s.id = 17 END FROM testdoc AS d USE INDEX(doc_idx)\nWHERE type= ‘report’ AND ANY v IN d.data SATISFIES v.id = 17 END;”
}

vsr1 · April 6, 2018, 4:31pm

The plan looks good. number of qualified items might be too high. That is why it taking time.

revathi · April 9, 2018, 2:47pm

oh ok. Thanks for the response.

Yes for SUM() and DISTINCT functions with high volume of document, its taking more time. For just 100,000 doucments itself its taking around 25 seconds. We are having atmost 1 million documents.

So you are saying is couchbase db won’t be good for high volume data(SUM, DISTINCT) operations. Do you recommend any other solutions?

clinton1ql · April 9, 2018, 5:01pm

Depending on data accuracy you might be able to get away with using the older views to get the data for you.

revathi · April 9, 2018, 6:36pm

Sorry for asking this again, we are new to Couchbase we are working for larger dataset which can store more than 1 million rows per document and we are using Excel like filter to show the data based on the user filtered column(s), please help me to understand what is the “older views” and how i can implement for my data set?

vsr1 · April 9, 2018, 6:40pm

Do you mean more than million in array for each document? You should think about data model https://developer.couchbase.com/documentation/server/5.1/data-modeling/intro-data-modeling.html

revathi · April 9, 2018, 6:44pm

No. more than 1 million document in a bucket and each document contains array size of 25, which contains the following data.

{
“documentId”:“1234”
“data”: [{
“group”: “”,
“id”: 0,
“required”: true,
“value”: “07/01/2017”
},
{
“group”: “”,
“id”: 1,
“required”: true,
“value”: “07/01/2017”
},
…
]
}

vsr1 · April 9, 2018, 6:51pm

Please post your SUM, DISTINCt query.
array has 25 then the multiplication factor will be 25.
What is total qualification items fro your query.

revathi · April 9, 2018, 7:03pm

Sample Doc:
{
“_key”: “b6dce680-7543-4de4-a719-ba97bc025cf50”,
“data”: [
{
“group”: “”,
“id”: 0,
“originalValue”: “10000”,
“required”: true,
“type”: “alphanumeric”,
“value”: “10000”
},
{
“group”: “”,
“id”: 1,
“originalValue”: “Option”,
“required”: true,
“type”: “alphabet”,
“value”: “Option”
},
{
“group”: “”,
“id”: 2,
“originalValue”: “2017-07-01”,
“required”: true,
“type”: “date(MM/dd/yyyy)”,
“value”: “07/01/2017”
},
{
“group”: “”,
“id”: 3,
“originalValue”: “2020-07-01”,
“required”: true,
“type”: “date(MM/dd/yyyy)”,
“value”: “07/01/2020”
},
{
“group”: “”,
“id”: 4,
“originalValue”: “3042”,
“required”: true,
“type”: “int”,
“value”: “3042”
},
{
“group”: “”,
“id”: 5,
“originalValue”: “20.00”,
“required”: true,
“type”: “currency”,
“value”: “$20.00”
},
{
“group”: “”,
“id”: 6,
“originalValue”: “20”,
“required”: true,
“type”: “int”,
“value”: “20”
},
{
“group”: “rsu”,
“id”: 7,
“originalValue”: “1.29”,
“required”: true,
“type”: “percent”,
“value”: “1.29%”
},
{
“group”: “rsu”,
“id”: 8,
“originalValue”: “17.30”,
“required”: true,
“type”: “percent”,
“value”: “17.30%”
},
{
“group”: “rsu”,
“id”: 9,
“originalValue”: “1.10”,
“required”: true,
“type”: “percent”,
“value”: “1.10%”
},
{
“group”: “rsu”,
“id”: 10,
“originalValue”: “3.00”,
“required”: true,
“type”: “double”,
“value”: “3.00”
},
{
“group”: “”,
“id”: 11,
“originalValue”: “6.00”,
“required”: true,
“type”: “percent”,
“value”: “6.00%”
},
{
“group”: “”,
“id”: 12,
“originalValue”: “47.07”,
“required”: true,
“type”: “currency”,
“value”: “$47.07”
},
{
“group”: “”,
“id”: 13,
“originalValue”: “44.24”,
“required”: true,
“type”: “currency”,
“value”: “$44.24”
},
{
“group”: “”,
“id”: 14,
“originalValue”: “415”,
“required”: true,
“type”: “int”,
“value”: “415”
},
{
“group”: “”,
“id”: 15,
“originalValue”: “7.39”,
“required”: true,
“type”: “currency”,
“value”: “$7.39”
},
{
“group”: “”,
“id”: 16,
“originalValue”: “54.41”,
“required”: true,
“type”: “currency”,
“value”: “$54.41”
},
{
“group”: “”,
“id”: 17,
“originalValue”: “47.06882441775697”,
“required”: false,
“type”: “currency”,
“value”: “$47.07”
},
{
“group”: “”,
“id”: 18,
“originalValue”: “-0.0011755822430288276”,
“required”: false,
“type”: “currency”,
“value”: “$0.00”
},
{
“group”: “”,
“id”: 19,
“originalValue”: “44.24469495269155”,
“required”: false,
“type”: “currency”,
“value”: “$44.24”
},
{
“group”: “”,
“id”: 20,
“originalValue”: “0.00469495269155118”,
“required”: false,
“type”: “currency”,
“value”: “$0.00”
},
{
“group”: “”,
“id”: 21,
“originalValue”: “7.387572241188462”,
“required”: false,
“type”: “currency”,
“value”: “$7.39”
},
{
“group”: “”,
“id”: 22,
“originalValue”: “-0.0024277588115380055”,
“required”: false,
“type”: “currency”,
“value”: “$0.00”
},
{
“group”: “”,
“id”: 23,
“originalValue”: “36.857122711503095”,
“required”: false,
“type”: “currency”,
“value”: “$36.86”
},
{
“group”: “”,
“id”: 24,
“originalValue”: “-17.5528772884969”,
“required”: false,
“type”: “currency”,
“value”: “($17.55)”
}
],
“id”: 0,
“issue”: false,
“procedureId”: “b6dce680-7543-4de4-a719-ba97bc025cf5”
}

Query:
SELECT distinct ARRAY s.value FOR s IN d.data WHEN s.id = 17 END FROM procedurecontent AS d
WHERE procedureId= ‘b6dce680-7543-4de4-a719-ba97bc025cf5’ AND ANY v IN d.data SATISFIES v.id = 17 END

Index:
CREATE INDEX proc_value_idx ON procedurecontent((distinct (array (s.value) for s in data end))) WHERE (procedureId = “b6dce680-7543-4de4-a719-ba97bc025cf5”)

vsr1 · April 9, 2018, 7:11pm

Please check explain the query may not using the index
Is procedureId , v.id are constant across all queries or it varies. For given document v.id is unique

Assuming procuderId, v.Id can be changed from query to query and v.id is unique in the array
The following is best option

CREATE INDEX proc_value_idx ON procedurecontent(procedureId, DISTINCT ARRAY s.id FOR s IN data END);

SELECT DISTINCT FIRST s.value FOR s IN d.data WHEN s.id = 17 END FROM procedurecontent AS d
WHERE d.procedureId = "b6dce680-7543-4de4-a719-ba97bc025cf5" AND ANY v IN d.data SATISFIES v.id = 17 END;

revathi · April 16, 2018, 6:26pm

Sorry for the late reply. I tried this query too, but no much difference in query execution time. As discussed before I think since we have large amount of data, its taking some time. Thanks for the response

Topic		Replies	Views
I have a problem with ARRAY INDEXES Couchbase Server	24	2956	September 29, 2016
Difference between Distinct and All array indexes re: Performance SQL++	1	2761	August 6, 2017
Array Distinct N1Ql SQL++	1	572	November 2, 2021
Optimising count query SQL++ query	2	623	May 2, 2019
Need help for the array query SQL++ array-query	10	1235	January 19, 2021

Array with DISTINCT is slow

Related topics