Help for optimizing

@vsr1 runing SELECT DISTINCT userId FROM default WHERE clientId=$clientId AND type='User'; returns the following error
Index scan timed out - cause: Index scan timed out

Post explain. How many entries you have matching clientId. you may have to increase indexer timeout

About 5.5 millions

{
  "plan": {
    "#operator": "Sequence",
    "~children": [
      {
        "#operator": "IndexScan3",
        "covers": [
          "cover ((`default`.`clientId`))",
          "cover ((`default`.`userId`))",
          "cover ((meta(`default`).`id`))"
        ],
        "distinct": true,
        "filter_covers": {
          "cover ((`default`.`type`))": "User"
        },
        "index": "ix1",
        "index_id": "26b5b6c5f9954389",
        "index_projection": {
          "entry_keys": [
            0,
            1
          ]
        },
        "keyspace": "default",
        "namespace": "default",
        "spans": [
          {
            "exact": true,
            "range": [
              {
                "high": "\"ba2e8f22-1b7b-4407-82fa-445400fb4afd\"",
                "inclusion": 3,
                "low": "\"ba2e8f22-1b7b-4407-82fa-445400fb4afd\""
              }
            ]
          }
        ],
        "using": "gsi"
      },
      {
        "#operator": "Parallel",
        "~child": {
          "#operator": "Sequence",
          "~children": [
            {
              "#operator": "Filter",
              "condition": "((cover ((`default`.`clientId`)) = \"ba2e8f22-1b7b-4407-82fa-445400fb4afd\") and (cover ((`default`.`type`)) = \"User\"))"
            },
            {
              "#operator": "InitialProject",
              "distinct": true,
              "result_terms": [
                {
                  "expr": "cover ((`default`.`userId`))"
                }
              ]
            },
            {
              "#operator": "Distinct"
            },
            {
              "#operator": "FinalProject"
            }
          ]
        }
      },
      {
        "#operator": "Distinct"
      }
    ]
  },
  "text": "SELECT DISTINCT userId FROM default WHERE clientId='ba2e8f22-1b7b-4407-82fa-445400fb4afd' AND type='User'"
}

Plan is optimized. Given clientId 5.5 million qualifies? Then you have to increase the Indexer time out.

@vsr1 , is there any tricks to get result faster? I need to send notifications less than a minutes

You can use JavaSDK aynchrouns operation see if that helps. Once you get first userId you can use second query and return results.

If you can handle DISTINCT in application, you can use OFFSET and LIMIT

OR use keyset pagination

CREATE INDEX ix1 ON default( clientId, userId ) WHERE type=‘User’;
SELECT DISTINCT userId FROM default WHERE clientId=$clientId AND type=‘User’ AND userId > $userId ORDER BY userId LIMIT 1000;
start userId with “” and repeat query until now more rows with last userId as input to next one As described here https://blog.couchbase.com/offset-keyset-pagination-n1ql-query-couchbase/

@vsr1 I am using Streaming rows in nodejs client https://docs.couchbase.com/nodejs-sdk/2.6/n1ql-queries-with-sdk.html#streaming-rows

I have error event with code 1080

1080 Timeout [setting] exceeded Timeout was exceeded.

@brett19 How can I increate timeout in nodejs?

Using userId > $userId , must I use ORDER BY userId to guarantee that I iterate all users?

you can use ORDER BY userId if it uses index i mentioned it will be no-op.

@vsr1 What happens if I dont use ORDER BY , I think may be I get a userId mode than one time and never get other userId

User ORDER BY and let query decide and use index order.

How can I increate timeout in nodejs?

@vsr1 How you create index for the following query (Please explain why , I have question about ORDER BY and !=)

SELECT a,b FROM default 
WHERE 
  type='MyType' AND
  c=$c AND
  d!=$d AND
  e IN [$e1,$e2]
ORDER BY
  z

@vsr1 Is it possible to cover UPDATE too?

UPDATE default SET a='A' WHERE b='B'

UPDATE need to update whole document, not partial update. So it needs Fetch the document and can’t be covered.

d != $d is DNF transformed d < $d || d > $d. Follow the rules described here https://blog.couchbase.com/create-right-index-get-right-performance/.

1 Like

UPDATE need to update whole document, not partial update. So it needs Fetch the document and can’t be covered.

Is it check CAS during update? to ensure consistency

What about ORBER BY?

for the following query

SELECT a,b FROM default 
WHERE 
  type='MyType' AND
  c=$c AND
  d!=$d AND
  e IN [$e1,$e2]
ORDER BY
  z

I think the best index is

CREATE INDEX idx1 bucket(c, e, d, z, a ,b) WHERE type='MyType'

I think , I must put z before a,b , is it right? or no difference?

UPDATE CAS check done internally. If it fails it returns error and application needs to retry.
ORDER BY must follow index key order. It is okay to skipping single equality key.

CREATE INDEX idx1 bucket(c, z, e, d, a ,b) WHERE type=‘MyType’

You can always create index on empty bucket and do EXPLAIN and checkout.

1 Like

I really don’t understand why

CREATE INDEX idx1 bucket(c, z, e, d, a ,b) WHERE type=‘MyType’

Is better than

CREATE INDEX idx1 bucket(c, e, d, z, a ,b) WHERE type='MyType'

Explain shows that first it index scan, then filter, then sort data, So how it can filter data in c field but what about e? You put z after c , how it can efficiently filter e?

it is not always possible to create efficient index without Cost Based optimization.
You need to findout which one better based on your data.

CREATE INDEX idx1 bucket(c, z, e, d, a ,b) WHERE type=‘MyType’
CREATE INDEX idx2 bucket(c, e, d, z, a ,b) WHERE type='MyType'

SELECT a,b FROM default 
WHERE 
  type='MyType' AND
  c=$c AND
  d!=$d AND
  e IN [$e1,$e2]
ORDER BY
  z;

Which index is better for depends on Filter optimization or whole query optimization. How many items
qualify etc…
ix1 is better per filter optimization. If the query produces 1M items sorting may expensive.
In that situation if adding z NOT NULL predicate ix2 optimal.

In rule based optimization, You need to try different indexes and pick right one. If required provide USE INDEX hint. Cost based optimization will take care of this internally.

1 Like

I really cannot understand!!
Assume this following scenario

  1. We have an UPDATE query that matches 1 million documents
  2. We are in production and have high throughput and mutations
  3. We execute UPDATE query , to be completed it takes 15 minutes
  4. As I understand from EXPLAIN , couchbase first get all documents , then mutate it , So if at fetching a document with key A has CAS1 , at 5th minutes we change it from Data service , so it have now CAS2 , at 10th minutes, query want to mutate it with CAS1 , Oh! failed and query returns error
  5. I must retry query , and we have go to step3 , So it is likely to failed again and again?!

On other scenario may be , couchbase fetches only doc keys and when it want update document it fetches document then mutate it immediately and fetches next document , it is better scenario than the first, but still in high mutation scenario and millions matching document , it’s not possible to complete query

What is the solution?