Is there a good way to query for the presence of a sub-field? This seems to be very slow

jc · December 11, 2020, 5:23am

I have documents in bucket MY_BUCKET of the form:

"_class": "class.name",
"docType": "MY_TYPE",
"primaryObject": {
  "key1": {
    "subA1": {
      "subB1": 123
    },
    "subA1": {
      "subB1": 234
    }
  }, 
  "key2": {
    "subA1": {
      "subB1": 456
    },
    "subA1": {
      "subB1": 789
    }
  }, 
  "key3": {
    "subA1": {
      "subB1": 222
    },
    "subA1": {
      "subB1": 333
    }
  },
  ...
  "keyN": {
    "subA1": {
      "subB1": 222
    },
    "subA1": {
      "subB1": 333
    }
  }
}

Some of these are missing “key1”. I want to query for all documents which are not missing “key1”, and then return the document in the original structure but with “key2”…“keyN” omitted. It needs to go into a Java object in the end, so specifically I want to return this:

[
  "id": "ABC::123::XYZ"
  "docType": "MY_TYPE",
  "primaryObject": {
    "key1": {
      "subA1": {
        "subB1": 123
      },
      "subA1": {
        "subB1": 234
      }
    }
  },
...
(multiple such results)
]

I am trying to find a good way to query this. The following seems to be the obvious choice:

SELECT meta().id as id, docType, {"key1": primaryObject.`key1`} as primaryObject 
FROM BUCKET_NAME
WHERE docType = "MY_TYPE" AND primaryObject.`key1` IS NOT MISSING
LIMIT [some integer]

However this runs very slowly. Even for a limit of 10, sometimes it just hits my 75 second timeout and fails. Is there a better way to run this query? As far as indices, currently I have an index on docType and nothing else.

EDIT: “key1” could be one of ~200 different values, and I need to have the ability to query using any one of them.

vsr1 · December 11, 2020, 5:38am

CREATE INDEX ix1 ON  BUCKET_NAME(primaryObject.`key1`) WHERE  docType = "MY_TYPE";
SELECT meta().id as id, docType, {"key1": primaryObject.`key1`} as primaryObject 
FROM BUCKET_NAME USE INDEX (ix1)
WHERE docType = "MY_TYPE" AND primaryObject.`key1` IS NOT MISSING
LIMIT [some integer]

https://index-advisor.couchbase.com/indexadvisor/#1

jc · December 11, 2020, 5:44am

Great thank you. As a followup, is there a way to make this variable? I.e. I need to be able to run this arbitrarily on “key2” or “keyN” as well. I believe that I could make the 200 or so indices needed to keep the approach you’ve given, but I think my team will not be happy about it.

vsr1 · December 11, 2020, 5:49am

In that case use following. It must fetch those documents.

CREATE INDEX ix2 ON  BUCKET_NAME(DISTINCT OBJECT_NAMES(primaryObject) ) WHERE  docType = "MY_TYPE";

Set query parameters $k to the "keyN" or replace with value

SELECT meta().id as id, docType, {$k: primaryObject.[$k]} as primaryObject 
FROM BUCKET_NAME 
WHERE docType = "MY_TYPE" AND  ANY v IN OBJECT_NAMES(primaryObject) SATISIFIES v = $k END
LIMIT [some integer]

jc · December 11, 2020, 10:24am

Thanks a lot, that gets me under the timeout in all relevant cases. It takes 1 minute to get 500 of 40000 total results from 5 million docs, and for a small number I can get results in 10-20 seconds. I’m wondering now though if it’s worth modifying the document structure somehow. Is there anything I could do on that side? The keys are arbitrary strings which I cannot modify. Otherwise maybe I will cache the results for each key.

vsr1 · December 11, 2020, 2:04pm

CREATE INDEX ix2 ON  BUCKET_NAME(ALL OBJECT_NAMES(primaryObject) ) WHERE  docType = "MY_TYPE";

Set query parameters $k to the "keyN" or replace with value

SELECT RAW meta(b).id
FROM BUCKET_NAME  AS b
UNNEST OBJECT_NAMES(b) AS ok
WHERE b.docType = "MY_TYPE" AND   ok =$k
LIMIT [some integer]

Use above query get the document keys (make sure it covered), then use SDK subdoc API get the relevant part of the document reactively.

jc · December 11, 2020, 3:13pm

Why is it that if I add another AND condition to my WHERE clause (AND meta().id LIKE [pattern]), suddenly the time skyrockets again? Each query on its own runs quickly.

vsr1 · December 11, 2020, 3:32pm

META().id predicate doesn’t pushed to indexer as there is no index key. It will be post fetch predicate.

jc · December 11, 2020, 3:39pm

Ok thanks I’ll read the source you gave. In the end will I need to drop the “meta().id LIKE” if I want to keep the performance up, or is it just a matter of adding another index? By the way, the new index you recommended is a dramatic performance improvement over the first one, so thanks again for that.

vsr1 · December 11, 2020, 3:40pm

try this as long as wild card is at the end. i.e u have constant prefix

CREATE INDEX ix2 ON BUCKET_NAME(ALL OBJECT_NAMES(primaryObject) ,meta().id) WHERE docType = "MY_TYPE";

jc · December 11, 2020, 3:45pm

I need to do both We should probably go back and just add some data fields but right now I’m stuck searching the document Id’s, which are like “ATTRIBUTE1::ATTRIBUTE2”.

vsr1 · December 11, 2020, 3:50pm

If you need both then constant prefix performs, others you have same issue.

Other option is key is fixed format use SPLIT(META().id,"::")[1] on both index and query

justusweber · December 15, 2020, 10:58pm

I dont know what the performance will be like, but have you tried using the IS/IS NOT MISSING comparison operator?

select meta().id from BUCKET_NAME where key1 IS NOT MISSING and docType = ‘MY_TYPE’

Topic		Replies	Views
Query on sub-document SQL++	8	675	October 8, 2021
Querying for all fields of each documents (without the sub-field) SQL++	17	6050	February 1, 2017
Count query on subdocument SQL++	10	1986	July 8, 2019
Query for META().id FROM subquery results SQL++ n1ql	2	978	August 16, 2020
Indexing on Sub Document SQL++	5	2325	June 23, 2017

Is there a good way to query for the presence of a sub-field? This seems to be very slow

Related topics