Analytics query perfomance

Hello, fellow developers!

I’m currently working on a complex N1QL query and would appreciate your insights for potential optimizations or alternative approaches. The primary goal of this query is to return a dataset with an entity_count indicator, which signals the presence of at least one entity meeting the conditions outlined in the subquery. Before this query was using JOIN with GROU BY.

Here’s the current query structure I’m using:

SELECT {
    'en': col_doc.data.en,
    'lang': IFMISSING(col_doc.data.en, {}),
    'doc_id': col_doc.id,
    'orderBy': ARRAY_LENGTH(OBJECT_VALUES(col_doc._orderBy)) > 0,
    'entity_count': (
        SELECT VALUE col_entity.id 
        FROM dev.project.entity col_entity 
        WHERE ARRAY_CONTAINS(ARRAY_STAR(col_entity.data.en.categories).alias, col_doc.data.en.alias)
            AND col_entity._scope = 'general_scope'
            AND CASE 
                WHEN (ARRAY_LENGTH(col_entity.data.en.provider.allowed_countries) = 0) 
                THEN 'AL' NOT IN col_entity.data.en.provider.restricted_countries 
                ELSE 'AL' IN col_entity.data.en.provider.allowed_countries 
            END
            AND ('EUR' NOT IN col_entity.data.en.provider.restricted_currencies)
            AND ('AL_Elbasan' NOT IN col_entity.data.en.provider.restricted_regions)
            AND (col_entity.data.en.provider.provider_view IS MISSING
                OR col_entity.data.en.provider.provider_view = TRUE)
            AND col_entity.data.en.provider IS NOT MISSING 
        LIMIT 1
    )
} AS dt,
COALESCE(col_doc._orderBy.base) AS orders
FROM dev.project.categories_entity col_doc
WHERE (
    col_doc.data.en.category_view = TRUE
    AND (ARRAY_LENGTH(col_doc.data.en.allowed_countries) = 0
        OR ARRAY_CONTAINS(col_doc.data.en.allowed_countries, 'AL') = TRUE)
    AND (ARRAY_LENGTH(col_doc.data.en.restricted_countries) = 0
        OR ARRAY_CONTAINS(col_doc.data.en.restricted_countries, 'AL') = FALSE)
    AND col_doc._scope = "general_scope"
    AND col_doc.status = "active"
)
ORDER BY orders
LIMIT 25

Can anyone suggest alternative (more optimal) ways to structure this query, especially concerning how the entity_count subquery is handled? Are there performance considerations or improvements that I might be missing?

Any feedback, optimization tips, or ideas would be greatly appreciated!

Hi @Maxim,
Would you mind sharing the query that was using JOIN and GROUP BY before?

Also, it will be helpful if you can run the following count queries and report to us the cardinalities:

SELECT COUNT(DISTINCT col_doc.data.en.alias)
FROM dev.project.categories_entity col_doc;

SELECT COUNT(DISTINCT aliases)
FROM dev.project.entity col_entity
UNNEST ARRAY_STAR(col_entity.data.en.categories).alias AS aliases
WHERE col_entity._scope = 'general_scope'
  AND CASE WHEN (ARRAY_LENGTH(col_entity.data.en.provider.allowed_countries) = 0)
      THEN 'AL' NOT IN col_entity.data.en.provider.restricted_countries
      ELSE 'AL' IN col_entity.data.en.provider.allowed_countries
      END
  AND ('EUR' NOT IN col_entity.data.en.provider.restricted_currencies)
  AND ('AL_Elbasan' NOT IN col_entity.data.en.provider.restricted_regions)
  AND (col_entity.data.en.provider.provider_view IS MISSING
    OR col_entity.data.en.provider.provider_view = TRUE)
  AND col_entity.data.en.provider IS NOT MISSING;

You mentioned the following:

I see the subquery for the field entity_count returns a single id. Do you care if the value can be a boolean (true or false) to indicate the presence instead? Or do you really need to get the value of the id?

1 Like