Array indexing / Search Performance Question

Hi,
We are in our final testing for qualifying Couchbase on a large customer project. At this stage we have a performance question. Perhaps are we doing it wrong.

Our model will look like this:

{
g: “global type of records”
l: “local filter”
a: [ “array filter values” ]
}

For our test we generated 500000 records:
[{“g”:“e4bae564-ca56-42de-968b-fcc78a8f604e”,“l”:“66502cf3-e952-4bf4-9355-a16c9e343d3a”,“a”:[“66502cf3-e952-4bf4-9355-a16c9e343d3a”]}

The g is always the same, will have the same “l” and “a” filters values, the value is new every 100 records. The only difference between a and l is the type, one is an array of string, one is a string.
So at the end we have 500000 records with same g, and 5000 groups of 100 records with the same l and a value.
The index is:
CREATE INDEX idx_tstp ON TSTP(g,l,(distinct (array v for v in a end)),a) WHERE (g = “e4bae564-ca56-42de-968b-fcc78a8f604e”)

The question we have is the difference of performance between the two requests:
SELECT g FROM TSTP where l = “66502cf3-e952-4bf4-9355-a16c9e343d3a” AND g=“e4bae564-ca56-42de-968b-fcc78a8f604e” LIMIT 5
– and this one –
SELECT g FROM TSTP where ANY v in a SATISFIES v = “66502cf3-e952-4bf4-9355-a16c9e343d3a” END AND g=“e4bae564-ca56-42de-968b-fcc78a8f604e” LIMIT 5

I would have expect, because of the indexation, to have more or less the same performance, or we are going from 5ms for the first one to 300ms for the second one.

The issue is this kind of request will be quite done often in our solution. And we have to ensure some API response time below 1s.

Is there any way to improve the search of string values in an array ?

Test done on: couchbase 6.6.1, 1 Node, 16 Cores, 64Gb RAM, bucket 4GB Ram, Index 4GB Ram

Best regards,
David.

CREATE INDEX `ix1` ON `TSTP` ( `g` , `l`) WHERE ( `g` = “e4bae564-ca56-42de-968b-fcc78a8f604e”);
SELECT g 
FROM TSTP 
WHERE l = “66502cf3-e952-4bf4-9355-a16c9e343d3a” 
            AND g=“e4bae564-ca56-42de-968b-fcc78a8f604e” LIMIT 5;

Above query use ix1 index

CREATE INDEX `ix2` ON `TSTP` ( `g` ,  DISTINCT a) WHERE ( `g` = “e4bae564-ca56-42de-968b-fcc78a8f604e”);

SELECT g 
FROM TSTP 
where ANY v in a SATISFIES v = “66502cf3-e952-4bf4-9355-a16c9e343d3a” END 
          AND g=“e4bae564-ca56-42de-968b-fcc78a8f604e” LIMIT 5

Above query use ix2 index.

Array index flatten the array and creates new entry in the index for every element as if it is different document. In that process if the index keys has non array index key values those repeats. If you have 50 array elements for single document., index will have 50 entries. In your case idx_tstp 3rd index key is a i.e. a is repeated every index row. That increases index size. Note: Index similar like b-tree structure
As B-tree index the items in the index sorted in first key, when duplicates sorted in second key,…
In your array index query you did not give l as predicate that means index needs to scan all l values. When you skip key it needs scan all the possible values because it is all over the index. That is reason it takes time.

https://index-advisor.couchbase.com/indexadvisor/#1

Thanks a lot, this has helped me by a lot.
Perhaps this will be good, to have on your documentation a way to explain this.

So to resume:

  • If you index an array, try to make it’s own index
  • index fields orders matters by a lot. It’s seem the best results i have is:
    the fields involved in the where of the index in first
    the fields that are used for ordering for the query in second
    the fields that will be used for filtering the data by a lot (the most significant one)
  • coverage fields in index is not always good
  • interscan of index is better than trying to have one index for all.

The skipping keys is a little bit tricky, the balance between plenty of index for all kind of search, and less index covering more search case. But this will be more a design choice.

The only issue i still have is with index having a filter on a field that must exist:
CREATE INDEX i ON TSTP where a IS VALUED
Seems to not work if i query with WHERE a IS VALUED.

It’s working if i do
CREATE INDEX i ON TSTP where a IS NOT NULL and a IS NOT MISSING
Seems to not work if i query with WHERE a IS NOT NULL and a IS NOT MISSING
Thanks again.

David.