Is there a good way to query for the presence of a sub-field? This seems to be very slow

I have documents in bucket MY_BUCKET of the form:

"_class": "class.name",
"docType": "MY_TYPE",
"primaryObject": {
  "key1": {
    "subA1": {
      "subB1": 123
    },
    "subA1": {
      "subB1": 234
    }
  }, 
  "key2": {
    "subA1": {
      "subB1": 456
    },
    "subA1": {
      "subB1": 789
    }
  }, 
  "key3": {
    "subA1": {
      "subB1": 222
    },
    "subA1": {
      "subB1": 333
    }
  },
  ...
  "keyN": {
    "subA1": {
      "subB1": 222
    },
    "subA1": {
      "subB1": 333
    }
  }
}

Some of these are missing “key1”. I want to query for all documents which are not missing “key1”, and then return the document in the original structure but with “key2”…“keyN” omitted. It needs to go into a Java object in the end, so specifically I want to return this:

[
  "id": "ABC::123::XYZ"
  "docType": "MY_TYPE",
  "primaryObject": {
    "key1": {
      "subA1": {
        "subB1": 123
      },
      "subA1": {
        "subB1": 234
      }
    }
  },
...
(multiple such results)
]

I am trying to find a good way to query this. The following seems to be the obvious choice:

SELECT meta().id as id, docType, {"key1": primaryObject.`key1`} as primaryObject 
FROM BUCKET_NAME
WHERE docType = "MY_TYPE" AND primaryObject.`key1` IS NOT MISSING
LIMIT [some integer]

However this runs very slowly. Even for a limit of 10, sometimes it just hits my 75 second timeout and fails. Is there a better way to run this query? As far as indices, currently I have an index on docType and nothing else.

EDIT: “key1” could be one of ~200 different values, and I need to have the ability to query using any one of them.

CREATE INDEX ix1 ON  BUCKET_NAME(primaryObject.`key1`) WHERE  docType = "MY_TYPE";
SELECT meta().id as id, docType, {"key1": primaryObject.`key1`} as primaryObject 
FROM BUCKET_NAME USE INDEX (ix1)
WHERE docType = "MY_TYPE" AND primaryObject.`key1` IS NOT MISSING
LIMIT [some integer]

https://index-advisor.couchbase.com/indexadvisor/#1

Great thank you. As a followup, is there a way to make this variable? I.e. I need to be able to run this arbitrarily on “key2” or “keyN” as well. I believe that I could make the 200 or so indices needed to keep the approach you’ve given, but I think my team will not be happy about it.

In that case use following. It must fetch those documents.

CREATE INDEX ix2 ON  BUCKET_NAME(DISTINCT OBJECT_NAMES(primaryObject) ) WHERE  docType = "MY_TYPE";

Set query parameters $k to the "keyN" or replace with value

SELECT meta().id as id, docType, {$k: primaryObject.[$k]} as primaryObject 
FROM BUCKET_NAME 
WHERE docType = "MY_TYPE" AND  ANY v IN OBJECT_NAMES(primaryObject) SATISIFIES v = $k END
LIMIT [some integer]

Thanks a lot, that gets me under the timeout in all relevant cases. It takes 1 minute to get 500 of 40000 total results from 5 million docs, and for a small number I can get results in 10-20 seconds. I’m wondering now though if it’s worth modifying the document structure somehow. Is there anything I could do on that side? The keys are arbitrary strings which I cannot modify. Otherwise maybe I will cache the results for each key.

CREATE INDEX ix2 ON  BUCKET_NAME(ALL OBJECT_NAMES(primaryObject) ) WHERE  docType = "MY_TYPE";

Set query parameters $k to the "keyN" or replace with value

SELECT RAW meta(b).id
FROM BUCKET_NAME  AS b
UNNEST OBJECT_NAMES(b) AS ok
WHERE b.docType = "MY_TYPE" AND   ok =$k
LIMIT [some integer]

Use above query get the document keys (make sure it covered), then use SDK subdoc API get the relevant part of the document reactively.

Why is it that if I add another AND condition to my WHERE clause (AND meta().id LIKE [pattern]), suddenly the time skyrockets again? Each query on its own runs quickly.

META().id predicate doesn’t pushed to indexer as there is no index key. It will be post fetch predicate.

Ok thanks I’ll read the source you gave. In the end will I need to drop the “meta().id LIKE” if I want to keep the performance up, or is it just a matter of adding another index? By the way, the new index you recommended is a dramatic performance improvement over the first one, so thanks again for that.

try this as long as wild card is at the end. i.e u have constant prefix

CREATE INDEX ix2 ON BUCKET_NAME(ALL OBJECT_NAMES(primaryObject) ,meta().id) WHERE docType = "MY_TYPE";

I need to do both :grimacing: We should probably go back and just add some data fields but right now I’m stuck searching the document Id’s, which are like “ATTRIBUTE1::ATTRIBUTE2”.

If you need both then constant prefix performs, others you have same issue.

Other option is key is fixed format use SPLIT(META().id,"::")[1] on both index and query

I dont know what the performance will be like, but have you tried using the IS/IS NOT MISSING comparison operator?

select meta().id from BUCKET_NAME where key1 IS NOT MISSING and docType = ‘MY_TYPE’