Advice on creating an index for good performance

Poltar · January 17, 2022, 9:56pm

Hello. I’ve been using couchbase for a long time, but now I have a problem that I can’t solve effectively using an index.
I need to output data for pagination (using limit/offset), considering user preferences.
Suppose the data have the following structure:

`{"someTs": 1642454378, "id": 123456, "subjectId": 654321, "category": [1, 2]}`
`{"someTs": 1642454377, "id": 123457, "subjectId": 741852, "category": [19]}`

The subjectId and id can be almost any number, but users should be able to exclude certain subjectIds that they do not want to see in search results. Category contains one or more number from 0 to 19
I can only imagine this n1ql query:

Select * 
from MyData
 where (any cat in [1, 18] satisfies cat in category end) 
  and (subjectId not in [123,456,789,...,999999]) 
 and (id not in [741,852,963,...,987654]) 
order by someTs desc
 limit 20 
offset 0

(20, 40, 60 etc)
But the processing time of such a query increases significantly when the number of elements in the arrays increases. Is there any way to do it more efficiently? I tried replacing in/not in with multiple and but it didn’t give any performance gain.
I’ve seen suggestions with UNION on the forum, but it cannot be used with OFFSET and sorting by time among all documents, as far as I understand.
Any help would be appreciated)

vsr1 · January 17, 2022, 10:34pm

NOT IN is not easy predicate for indexScan (it is converted to range).

Best option will be use the following

CREATE INDEX ix1 ON MyData(someTs DESC, META().id DESC, DISTINCT category, id, subjectId);

start with $ts="", $key= next iteration give the last values

SELECT d.* , META(d).id
FROM MyData AS d
WHERE ANY cat IN category SATISFIES cat IN [1, 18] END
      AND subjectId NOT IN [123,456,789,...,999999]
      AND id NOT IN [741,852,963,...,987654]
      AND someTs <= $ts
      AND META(d).id < $key
ORDER BY someTs DESC, META(d).id DESC
LIMIT 20 ;

If required use covered query produce documents and do Fetch in parent query
NOT IN values are constants across all users try as Index condition with NOT ARRAY_CONTAINS([…], id)

assume NOT IN (excluding ) data is small you pinpoint directly seek position,

Poltar · January 18, 2022, 12:18am

Thank you very much for your help. I didn’t know of a way to exclude OFFSET, but unfortunately it didn’t help me.

Excluded subjectId, id and selected category are unique for each user (users themselves determine which categories they want to see, and which documents should be excluded from the results). Performance is still significantly reduced even with a small increase in elements in the arrays. An array of categories can have no more than 20 values, but an array of excluded documents, in theory, can be any size the user wants.
I ran each query three times and gave the average execution time:

SELECT d.* , META(d).id
FROM MyData AS d
WHERE ANY cat IN category SATISFIES cat IN [1,18] END
and subjectId not in [1,2,3,4,5,6,7,8,9,10]
 and id not in [100,200,300,400,500,600,700,800,900,1000]
 AND someTs is not missing
 AND META(d).id is not missing
ORDER BY someTs DESC, META(d).id DESC
LIMIT 20; 
//511ms

//Add 2 elements to each array
SELECT d.* , META(d).id
FROM MyData AS d
WHERE ANY cat IN category SATISFIES cat IN [1,3,4,18] END
and subjectId not in [1,2,3,4,5,6,7,8,9,10,11,12]
 and id not in [100,200,300,400,500,600,700,800,900,1000,1100,1200]
 AND someTs is not missing
 AND META(d).id is not missing
ORDER BY someTs DESC, META(d).id DESC
LIMIT 20 ;
//1.27s

//Add 4 more elements to each array
SELECT d.* , META(d).id
FROM MyData AS d
WHERE ANY cat IN category SATISFIES cat IN [1,3,4,5,6,7,8,18] END
and subjectId not in [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
 and id not in [100,200,300,400,500,600,700,800,900,1000,1100,1200,1300,1400,1500,1600]
 AND someTs is not missing
 AND META(d).id is not missing
ORDER BY someTs DESC, META(d).id DESC
LIMIT 20 ;
//3.9s

New index ix1 contains 10340 items with 100% resident ratio, machine has 6 core/12 threads CPU with 15/32 GB used RAM with Couchbase 6.6.
When I ran the last query I noticed that the CPU load went from 6 to 22%, and the beggining of plan output as follow (64k lines long at all):

{
  "#operator": "Sequence",
  "#stats": {
    "#phaseSwitches": 1
  },
  "~children": [
    {
      "#operator": "Authorize",
      "#stats": {
        "#phaseSwitches": 3,
        "servTime": "975.6µs"
      },
      "privileges": {
        "List": [
          {
            "Target": "default:MyData",
            "Priv": 7
          }
        ]
      },
      "~child": {
        "#operator": "Sequence",
        "#stats": {
          "#phaseSwitches": 1
        },
        "~children": [
          {
            "#operator": "Sequence",
            "#stats": {
              "#phaseSwitches": 1
            },
            "~children": [
              {
                "#operator": "DistinctScan",
                "#stats": {
                  "#itemsIn": 385,
                  "#itemsOut": 384,
                  "#phaseSwitches": 1543,
                  "kernTime": "3.7741828s"
                },
                "scan": {
                  "#operator": "IndexScan3",
                  "#stats": {
                    "#itemsOut": 385,
                    "#phaseSwitches": 1545,
                    "execTime": "1.9312ms",
                    "servTime": "3.7722516s"
                  },
                  "as": "d",
                  "index": "ix1",
...

vsr1 · January 18, 2022, 12:43am

@Poltar ,

If you use keyset pagination you transforming offset into start value (i.e last value from your query) . The link above explain every thing. If your data is not big u can stay with offset (in this case remove META().id from index and query predicate, order by)

SELECT t.*
FROM MyData AS t USE KEYS ( SELECT RAW d.k
                            FROM (SELECT META().id AS k, subjectId, id
                                  FROM MyData
                                  WHERE ANY cat IN category SATISFIES cat IN [1,3,4,5,6,7,8,18] END
                                        AND someTs is not missing
                                        AND META().id is not missing
                                  ORDER BY someTs DESC, META().id DESC) AS d
                            WHERE d.subjectId NOT IN [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
                            AND d.id NOT IN [100,200,300,400,500,600,700,800,900,1000,1100,1200,1300,1400,1500,1600]
                            LIMIT 20 );

NOT IN needs to walk through all elements (it uses Hash table once it has >= 16, i.e. will be 1 comparison (one time hash table build) ). If more than 4, try give duplicates or
non existing ones like string make 16, [1,2,3,4,5,6,7,8,9,10, “1”,“2”,“3”, “4”,“5”,“6”]).
Pushed NOT IN parent so that index scan doesn’t spike or transform NOT IN to range.

Poltar · January 18, 2022, 10:49pm

Thank you, this query works great in case of performance, but I can’t get keyset pagination to work.
The “keyset pagination” example you sent above uses the UUID as meta().id.
But my documents are named as “crd::$id” (crd::123456), the order of the id is not the order of the documents, it’s what time they were added that is important. If in rare cases someTs match (several documents were added in one second), the order is not important but constant and doesn’t break the pagination.
That means the data can be added to database in that order and I have to keep the document order based on someTs (make it reverse) when querying the database to retrieve the data:

1. {"someTs": 1642454000, "id": 123456, "subjectId": 654321, "category": [1]} //crd::123456 as meta().id
2. {"someTs": 1642454001, "id": 2, "subjectId": 654321, "category": [1]}
3. {"someTs": 1642454002, "id": 58, "subjectId": 654321, "category": [1]}
4. {"someTs": 1642454003, "id": 4, "subjectId": 654321, "category": [1]}
5. {"someTs": 1642454003, "id": 987, "subjectId": 654321, "category": [1]}
...

Unfortunately now with each subsequent query I get documents with a smaller id than the last document in the previous query, which is wrong. I’ve added 480 documents to the database for a test (the document count would be 10-20 thousand when used), and I only got 3 pages of 20 documents each.

Should I switch to using a UUID (although it’s not clear to me how Couchbase determines the time by a string like 038c8a13-e1e7-4848-80ec-8819ff923602) or am I doing something wrong?

Thank you again for your help and I apologize for taking up your time

vsr1 · January 19, 2022, 4:21am

META().id is unique in couchbase, It should work in your case you need order by someTs so u need to use both META().id and someTs together break loop.

If you are constantly mutating one, with previous to values you miss, but the same can happen with LIMIT/OFFSET (you can miss or get duplicates). Data is not MVCC.


CREATE INDEX ix1 ON MyData(someTs DESC, META().id DESC, DISTINCT category, id, subjectId);

start with $ts="", $key= next iteration give the last values

SELECT t.*
FROM MyData AS t USE KEYS ( SELECT RAW d.k
                            FROM (SELECT META().id AS k, subjectId, id
                                  FROM MyData
                                  WHERE ANY cat IN category SATISFIES cat IN [1,3,4,5,6,7,8,18] END
                                        AND someTs <= $ts
                                        AND META().id < $key
                                  ORDER BY someTs DESC, META().id DESC) AS d
                            WHERE d.subjectId NOT IN [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
                            AND d.id NOT IN [100,200,300,400,500,600,700,800,900,1000,1100,1200,1300,1400,1500,1600]
                            LIMIT 20 );

Same query set $ts, $key last value from previous execution

Poltar · January 19, 2022, 9:40am

I am performing test in local db where no data are changed during test.
Probably I should provide some example to clarify my question. I added field “name” to each document to demonstrate problem (also ignored all exclude as not related to pagination question).
Given request (on primary index, performance ignored for example purpose):
SELECT name, id, someTs FROM MyData order by someTs desc limit 14

[
  {
    "someTs": 1642369255567,
    "id": 3909231966,
    "name": "1"
  },
  {
    "someTs": 1642369255504,
    "id": 3739506717,
    "name": "2"
  },
  {
    "someTs": 1642369255442,
    "id": 3870235232,
    "name": "3"
  },
  {
    "someTs": 1642369255379,
    "id": 4509072012,
    "name": "4"
  },
  {
    "someTs": 1642369255315,
    "id": 2616444601,
    "name": "5"
  },
  {
    "someTs": 1642369255251,
    "id": 1890691602,
    "name": "6"
  },
  {
    "someTs": 1642369255190,
    "id": 1021872887,
    "name": "7"
  },
  {
    "someTs": 1642369255128,
    "id": 1827966129,
    "name": "8"
  },
  {
    "someTs": 1642369255066,
    "id": 3307694343,
    "name": "9"
  },
  {
    "someTs": 1642369255003,
    "id": 4399315647,
    "name": "10"
  },
  {
    "someTs": 1642369254940,
    "id": 1662388621,
    "name": "11"
  },
  {
    "someTs": 1642369254879,
    "id": 767925967,
    "name": "12"
  },
  {
    "someTs": 1642369254817,
    "id": 863714771,
    "name": "13"
  },
  {
    "someTs": 1642369254755,
    "id": 2452014389,
    "name": "14"
  }
]

(Name 1-14)

Query for first page (page size = 5 to compact output)

SELECT name, id, someTs FROM MyData USE KEYS ( SELECT RAW d.k FROM (SELECT META().id AS k, subjectId, id FROM MyData WHERE someTs is not missing AND meta().id is not missing AND category is not missing ORDER BY someTs DESC, META().id DESC) AS d LIMIT 5)

[
  {
    "someTs": 1642369255567,
    "id": 3909231966,
    "name": "1"
  },
  {
    "someTs": 1642369255504,
    "id": 3739506717,
    "name": "2"
  },
  {
    "someTs": 1642369255442,
    "id": 3870235232,
    "name": "3"
  },
  {
    "someTs": 1642369255379,
    "id": 4509072012,
    "name": "4"
  },
  {
    "someTs": 1642369255315,
    "id": 2616444601,
    "name": "5"
  }
]

Name 1-5, no problem. So for query next page I will use someTs<=1642369255315 and id<“crd::2616444601” from last element as I understand:
SELECT name, id, someTs FROM MyData USE KEYS ( SELECT RAW d.k FROM (SELECT META().id AS k, subjectId, id FROM MyData WHERE someTs <= 1642369255315 AND meta().id < "crd::2616444601" AND category is not missing ORDER BY someTs DESC, META().id DESC) AS d LIMIT 5 )

[
  {
    "someTs": 1642369255251,
    "id": 1890691602,
    "name": "6"
  },
  {
    "someTs": 1642369255190,
    "id": 1021872887,
    "name": "7"
  },
  {
    "someTs": 1642369255128,
    "id": 1827966129,
    "name": "8"
  },
  {
    "someTs": 1642369254940,
    "id": 1662388621,
    "name": "11"
  },
  {
    "someTs": 1642369254755,
    "id": 2452014389,
    "name": "14"
  }
]

Instead of name 6-10, I only get documents with id<2616444601, which is wrong (please pay attention to “name” field), probably because of the condition meta().id<“crd::2616444601”.
The id, name and someTs fields are constant for the documents and have not been edited.
Can I ask you to explain why this is happening?

vsr1 · January 19, 2022, 1:57pm

Hi @Poltar ,

Try this

SELECT t.*
FROM MyData AS t USE KEYS ( SELECT RAW d.k
                            FROM (SELECT META().id AS k, subjectId, id, someTs
                                  FROM MyData
                                  WHERE ANY cat IN category SATISFIES cat IN [1,3,4,5,6,7,8,18] END
                                        AND someTs <= $ts
                                  ORDER BY someTs DESC, META().id DESC) AS d
                            WHERE d.subjectId NOT IN [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
                            AND d.id NOT IN [100,200,300,400,500,600,700,800,900,1000,1100,1200,1300,1400,1500,1600]
                            AND (d.someTs < $ts OR d.k < $key)
                            LIMIT 20 );

i.e.

SELECT name, id, someTs FROM MyData USE KEYS ( SELECT RAW d.k FROM (SELECT META().id AS k, subjectId, id , someTS FROM MyData WHERE someTs <= 1642369255315 AND category is not missing ORDER BY someTs DESC, META().id DESC) AS d WHERE (d.someTs < 1642369255315 OR d.k < "crd::2616444601") LIMIT 5 )

In Index data is sorted based on someTs, if there is duplicates in someTs then it sorted on next field(META().id)

As keyset pagination uses second key unique. to handle LIMIT breaking data middle of someTs and next page choose where it left off, we should try equal value first then , in parent when value equal only compare with unique key (If used inner query it will push to indexer or index selection may not happen due to OR)

Topic		Replies	Views
Pagination Pushdown and Arrays SQL++	1	624	February 17, 2020
Indexing for performance SQL++ query , n1ql , index	1	704	April 13, 2020
Couchbase n1ql pagination so slow SQL++ query , n1ql , index	5	1082	August 27, 2019
Select query with sorting performance advice SQL++ query , n1ql	3	1313	October 25, 2017
Keyset pagination with N1QL, an index with multiple keys and per-key sort order SQL++	6	930	January 13, 2023

Advice on creating an index for good performance

Related topics