Need Advise On Group By Queries

Debasis_Mallick · March 2, 2025, 12:21pm

We have a requirement involving a data set of 70 million records. The dev team aims to display the count of a specific document type in the user interface using a group by query, along with other field values from this extensive data set. However, creating a covering index for this purpose would be resource-intensive. Is there an alternative method to maintain the count of these particular document types, allowing us to efficiently retrieve the count from the 70 million documents. Below is sample query and similarly there are around 10-12 queries.

select network, zone, regnw,count(1) from CRM where __t =‘SR’ group by network, zone, regnw ;

Thanks,
Debasis

mreiche · March 2, 2025, 4:02pm

Please show the index you are creating.
Indexes are a trade-off between creating (and maintaining) the index vs querying. Nothing is free.

Why not have a dedicated collection for ‘SR’ documents?

Also - this might be a case for Analytics.

vsr1 · March 3, 2025, 4:30pm

It has to scan and count no other way

mreiche · March 3, 2025, 5:28pm

Does a collection maintain a count of the number of documents in the collection such that it could be used without a scan?

vsr1 · March 3, 2025, 5:35pm

Collection has but can’t be done like each type, each group

Debasis_Mallick · March 4, 2025, 5:08am

@mreiche we are creating indexes as below.

CREATE INDEX adv_idx ON CRM(zone,regnw) WHERE (__t` = “SR”);

Thanks,
Debasis

mreiche · March 4, 2025, 5:08pm

See my comment above and vsr1’s last comment. Put documents of different types (‘__t’) in separate collections. The count of documents in a collection is accessible without a scan. Otherwise - it will require a scan.

vsr1 · March 4, 2025, 5:28pm

You are counting based on group, keeping separate collection also will not help

SELECT network, zone, regnw, count(1)
FROM CRM AS c
WHERE __t ="SR"
GROUP BY network, zone, regnw ;

CREATE INDEX i1 ON CRN(__t, network, zone, regnw);

Use above index. 70M entries take time. Depends on how many types and how many groups and how often changes you can generate new rows in different collection consolidated results and get those with out aggregates or cache results.

UPSERT INTO newcol1 (KEY k, VALUE v )
SELECT CONCAT2("_", __t, TOSTR(network), TOSTR(zone), TOSTR(regnw)) AS k.
       {__t, network, zone, regnw, "cnt":count(1)} AS v
FROM CRM AS c
WHERE __t IS NOT NULL
GROUP BY __t, network, zone, regnw ;

pccb · March 6, 2025, 8:08am

Thanks, @vsr1 & @mreiche for your thoughts.

@vsr1, The approach that you suggested seems like a shift-left of the GROUP BY, meaning that instead of running the aggregation when counts are needed, we do it at the time of mutation. Is that correct?

However, since mutations occur at a rate of hundreds or thousands per second, this approach may not be feasible.

Additionally, since the mutation happens via KV access, the application would require two network round trips—one for the KV-based mutation and another for the UPSERT.

We had considered a similar approach where the mutation happens via KV, and we maintain a separate document, incrementing a count by 1 each time using another KV mutation. This eliminates the need for an UPSERT, but it still suffers from the same challenge—two network round trips!

Topic		Replies	Views
Group By Type slow Analytics	3	411	November 15, 2023
Array indexing in Couchbase server 4.5 Couchbase Server	1	1407	July 22, 2016
Couchbase Query Optimization in CB Couchbase Server n1ql	9	366	October 9, 2023
Learning Couchbase Indexing/Views (Practical) Couchbase Server	4	242	June 25, 2024
Couchbase Data modelling and Indexing Couchbase Server query , n1ql , index	1	659	July 29, 2020

Need Advise On Group By Queries

Related topics