Optimizing analytics query with optional filters

I’m developing an analytics service that provides search engine functionality for some time-series data stored in a couchbase-analytics node.

The data contains a number of fields, similar to

{
  "id": "5434fs",
  "customerId": "customer1",
  "firstName": "John",
  "lastName": "Doe",
  "type": "FOO",
  "timestamp": 1590587636,
  "history": [
    {
      "status": "dispatched",
      "tracking": "3432432423",
      "timestamp": 1590587636
    },
    {
      "status": "recieved",
      "timestamp": 1590507636
    }
  ]
}

The service allows users to apply filters on one or more fields, so the N1QL query looks like this

SELECT `o`.`id`,
       `o`.`firstName`,
       `o`.`lastName`,
       `o`.`type`,
       `o`.`timestamp`,
       `o`.`status`,
       `o`.`tracking`
FROM `orders` o
UNNEST `o`.`history` h
WHERE `o`.`customerId` = $`customerId`
  AND `h`.`status` = 'dispatched'
  AND (IS_NULL($type) OR `o`.`type` = $type)
  AND (IS_NULL($firstName) OR LOWER(`o`.`firstName`) = LOWER($firstName))
  AND (IS_NULL($lastName) OR LOWER(`o`.`lastName`) = LOWER($lastName))
ORDER BY `o`.`timestamp` DESC
LIMIT $`limit`
OFFSET $`offset`

To start off with we added an index on pr.customerId and pr.timestamp

CREATE INDEX idx_customerId_timestamp ON `orders`(
  `customerId`:STRING,
  `timestamp`:BIGINT
)

The performance of the query on our Couchbase Enterprise Edition 6.0.4 build 3082 node wasn’t good enough for what we want to achieve so we have been trying to optimize it with little success so far.

We tried creating an index that covered all the parameters

CREATE INDEX idx_all_params ON `orders`(
  `customerId`:STRING,
  `timestamp`:BIGINT,
  `type`:STRING,
  `firstName`:STRING,
  `lastName`: STRING
)

But it resulted in some odd behaviour where queries would not return results with a timestamp before the idx_all_params index was created. From researching indexes a bit more I’ve learned that the use of LOWER means that indexes can’t be used on those fields, so we plan to make all the data lower case. It also doesn’t appear to be possible to index the unnest, so again we plan to refactor the data to remove it.

Are we on the right track here? Are we best creating single indexes that cover all parameters, or should we make multiple smaller indexes that can be used in parrallel?

Analytics currently does not support functional indexes (e.g lower(field_name)) and indexes on nested fields (for UNNEST).

Would it be possible to modify the query your service is generating to avoid this pattern?
IS_NULL($parameter) OR field_name = $parameter
For example, if $type value is provided by user then it would only generate o.type = $type, but if $type value is not given by user then it won’t generate the filter on it at all.

If this modification is possible then you could create 4 individual indexes on customerId, field containing lower case of type, field containing lower case of firstName, and field containing lower case of lastName.
If generated query no longer contains IS_NULL() OR … patterns, then it should utilize those indexes.

Thanks for the advice Dimitry.

I have modified my code to remove the IS_NULL($parameter) checks. It now adds filters only when the user provides a value for them. This has resulted in more indexes being used :+1:

Since we always have customerId and timestamp in the predicate, should we add those fields into our indexes? Or are we better to just make indexes with single fields?

E.g.

CREATE INDEX idx_customerId_timestamp_type ON `orders`(
  `customerId`:STRING,
  `timestamp`:BIGINT,
  `type`:STRING,
)

or

CREATE INDEX idx_type ON `orders`(
  `type`:STRING,
)

Hi Lewis,

Try creating one index for all fields that you always have in the predicate of the query (customerId and timestamp as you mentioned):
CREATE INDEX idx_customerId_timestamp ON `orders`(`customerId`:STRING, `timestamp`:BIGINT)

and individual single field indexes for additional fields that user can optionally search on:
CREATE INDEX idx_type ON `orders`(`type`:STRING)
CREATE INDEX idx_firstName ON `orders`(`firstName`:STRING)
CREATE INDEX idx_lastName ON `orders`(`lastName`: STRING)

1 Like

Thanks again Dimitry, my queries are running significantly faster now :partying_face:

I’ve got one more question about the idx_type index if thats ok? Over 90% of our documents have one particular value for type (e.g. “FOO”), would that mean that when we apply a filter for type = “FOO” the query will do an intersect operation on 90% of the document IDs in the idx_type index? Just wondering what impact that would have on performance.

1 Like

Hi Lewis,

Well, it’s an intersect operation that would match in 90% of the cases, so there might be some performance impact. Although it only operates on document IDs so hopefully the impact won’t be significant. You could try running a query with and without the filter (type=“FOO”) to measure the performance difference.