How to create INDEX in Analytics service

How to create the same index in Analytics service and guaranteed to use it in search:

CREATE INDEX idx_user ON default(someId, status, active, -MILLIS(creationDate)) WHERE type=“User”;

SELECT users.*
FROM default users
WHERE users.type=“User”
AND users.someId=“someId”
AND users.status=“following”
AND users.active=true
ORDER BY -MILLIS(users.creationDate)
LIMIT 10

Hi @Maxim, Analytics does not support functional indexes (nor a WHERE clause in index definition).

If you are trying to create a (covering) index to run a query in the Analytics service, then you do not need to do that. In Analytics, you should be able to run any query without any index.

Typically, you first create a collection in Analytics on some bucket/collection from the Data Service. Optionally, you can include a WHERE clause to filter only certain documents that will be ingested in the Analytics collection.
For example, to create an Analytics collection on a Data Service bucket called default, then the syntax is:

CREATE ANALYTICS COLLECTION `col_name` ON `default`;

If you would like to create the Analytics collection and only include documents that have the field type = “User”:

CREATE ANALYTICS COLLECTION `col_name` ON `default` WHERE `type` = "User";

After creating the Analytics collection, you should be able to run your query without having to create any index. However, you can always create Analytics secondary indexes to speed up queries on selective fields. As an example for your query, you can create the following secondary index:

CREATE INDEX `idx_user` ON `col_name`(`someId`: STRING, `status`: STRING, `type`: STRING);

Any query that has predicates on all or a prefix of the indexed fields should be able to utilize this index. You can try different index definitions (different fields and different prefixes) and see which one is better.

1 Like

Thank you for the quick response. However, I have some questions regarding which data types are supported by indexes in the Analytics service (the information about data types is incomplete in the documentation found here). Additionally, I would like to know under what conditions the created indexes will be applied during searches, and in which cases they will not. This question arose as a result of executing an EXPLAIN query, which showed that the created indexes are not being applied during the search.

Moreover, I need advice on optimizing queries that sort by the updated_at field (date string). I would appreciate any advice on these questions and examples of best practices in the Analytics service.

Hi @Maxim,
You can find the supported types for indexes here in the Index Specification where it highlights the supported types:

BIGINT (or INT), DOUBLE, STRING.
DATE, TIME, DATETIME (aimed for indexes to be used by views having datetime types).

You can find the information about using indexes here where it talks about when an index will be used. If you see that a query is not using an index when it should, please share with us the query and the index definition and we will help you.

Moreover, I need advice on optimizing queries that sort by the updated_at field (date string). I would appreciate any advice on these questions and examples of best practices in the Analytics service.

Could you please elaborate more on what exactly you are looking for? Could you share an example?

Hi @ali.alsuliman !
I’m working with a Couchbase query that selects data from a scope containing more than 130,000 records. I’d need to explore ways to optimize it for better performance, especially since the dataset continues to grow.

SELECT data_fields.*, doc_id
FROM (
  SELECT VALUE {'en': col_doc.data.en, 'lang': IFMISSING(col_doc.data.`fi`, col_doc.data.en), 'doc_id': col_doc.id, 'orderBy': ARRAY_LENGTH(OBJECT_VALUES(col_doc._orderBy)) > 0}
  FROM prod.project.service col_doc
    JOIN prod.project.providers_service col_doc_pr ON col_doc_pr.data.en.alias = col_doc.data.en.provider.alias
    AND col_doc_pr.data.en.provider_view = TRUE
    AND col_doc_pr._scope = 'delivery'
    AND (ARRAY_LENGTH(col_doc_pr.data.en.allowed_countries) = 0
      OR ARRAY_CONTAINS(col_doc_pr.data.en.allowed_countries, "DE"))
    AND (ARRAY_LENGTH(col_doc_pr.data.en.restricted_countries) = 0
      OR ARRAY_CONTAINS(col_doc_pr.data.en.restricted_countries, "DE") = FALSE)
    AND (ARRAY_LENGTH(col_doc_pr.data.en.restricted_regions) = 0
      OR ARRAY_CONTAINS(col_doc_pr.data.en.restricted_regions, "") = FALSE)
    AND (IFMISSING(ARRAY_LENGTH(col_doc_pr.data.en.restricted_currencies), 0) = 0
      OR ARRAY_CONTAINS(col_doc_pr.data.en.restricted_currencies, "EUR") = FALSE)
    JOIN prod.project.categories_service col_doc_cat ON col_doc_cat.data.en.alias IN ["popular"]
    AND col_doc_cat.data.en.category_view = TRUE
    AND col_doc_cat._scope = 'delivery'
    AND (ARRAY_LENGTH(col_doc_cat.data.en.allowed_countries) = 0
      OR ARRAY_CONTAINS(col_doc_cat.data.en.allowed_countries, "DE"))
    AND (ARRAY_LENGTH(col_doc_cat.data.en.restricted_countries) = 0
      OR ARRAY_CONTAINS(col_doc_cat.data.en.restricted_countries, "DE") = FALSE)
    AND (ARRAY_LENGTH(col_doc_cat.data.en.restricted_regions) = 0
      OR ARRAY_CONTAINS(col_doc_cat.data.en.restricted_regions, "") = FALSE)
  WHERE (ARRAY_LENGTH(ARRAY_INTERSECT(ARRAY_STAR(col_doc.data.en.categories).alias, ["popular"]))>0
      AND ARRAY_CONTAINS(col_doc.data.en.devices, 'mobile')=TRUE
      AND col_doc._scope="delivery")
  ORDER BY orderBy DESC,
           IFMISSING(col_doc._orderBy.`fi_mobile_promo`, IFMISSING(col_doc._orderBy.`mobile_promo`, IFMISSING(col_doc._orderBy.`fi_promo`, IFMISSING(col_doc._orderBy.`promo`, IFMISSING(col_doc._orderBy.`fi_mobile`, IFMISSING(col_doc._orderBy.`mobile`, IFMISSING(col_doc._orderBy.`fi`, col_doc._orderBy.base))))))) ASC,
           col_doc.updated_at DESC
  LIMIT 24) AS data_fields;

Thanks for your time!

Hello!

I would like to return to my question, as the problem is still relevant. Is there any chance to get more detailed information on how to optimize queries in Couchbase Analytics, especially when sorting by a date field (string type)? Specifically, I’m interested in how to ensure the index is used and improve performance with large data volumes.

I would also appreciate recommendations on creating efficient indexes for such cases. Thank you in advance!

Hi @Maxim,
Here are some key points to pay attention to and some guidance:

  • Secondary indexes are used for predicates in the condition of WHERE and JOIN clauses. They are not used by ORDER BY clause.
  • Before creating secondary indexes, take note of the sizes of each collection, and determine how long it takes to just do a full scan of the collection. This will give you a baseline:
SELECT COUNT(*) FROM prod.project.service;
SELECT COUNT(*) FROM prod.project.providers_service;
SELECT COUNT(*) FROM prod.project.categories_service;
  • Different join orders can result in different performance. For example, is your join order building with the smaller collection or the bigger collection? Knowing the collection sizes, you should choose a join order that builds on the smaller one.
  • Re-visit some of the logic in the query and try simplifying some of the function calls.
  • Always look at the plan of the query and check if there is any obvious inefficiencies that can be addressed. The textual plan could be simpler to understand than the JSON one. You can get the textual plan by prefixing your query with: EXPLAIN TEXT <your_query>;
  • Check any warning you might get in the response. Some times, it is indicative of some query issues that should be looked at.
  • Check the cluster/nodes’ specification (number of CPUs, memory, … etc) and the configuration of the Analytics Service.

Now, first it looks like your query does a “cross product join” due to the following join condition where you can see the join condition is not really a join condition between two collections. You might want to look and think about that:

JOIN prod.project.categories_service col_doc_cat ON col_doc_cat.data.en.alias IN ["popular"]

Second, you could try creating some secondary indexes to fetch only the documents needed by the query instead of doing a full scan. For this, you will have to determine if the secondary index you are planning to create is selective or not. Some possible ones are the following which may or may not be useful depending on their selectivity. Try each one independently:

CREATE INDEX idx_scope ON prod.project.service(_scope: string);
CREATE INDEX idx_scope ON prod.project.providers_service(_scope: string);
CREATE INDEX idx_scope ON prod.project.categories_service(_scope: string);
CREATE INDEX idx_alias ON prod.project.categories_service(data.en.alias: string);

Having said this, I would highly recommend you reach out to the Professional Services here at Couchbase to guide you with more details about the above.