Need Advise On Group By Queries

We have a requirement involving a data set of 70 million records. The dev team aims to display the count of a specific document type in the user interface using a group by query, along with other field values from this extensive data set. However, creating a covering index for this purpose would be resource-intensive. Is there an alternative method to maintain the count of these particular document types, allowing us to efficiently retrieve the count from the 70 million documents. Below is sample query and similarly there are around 10-12 queries.

select network, zone, regnw,count(1) from CRM where __t =‘SR’ group by network, zone, regnw ;

Thanks,
Debasis

Please show the index you are creating.
Indexes are a trade-off between creating (and maintaining) the index vs querying. Nothing is free.

Why not have a dedicated collection for ‘SR’ documents?

Also - this might be a case for Analytics.

It has to scan and count no other way

1 Like

Does a collection maintain a count of the number of documents in the collection such that it could be used without a scan?

1 Like

Collection has but can’t be done like each type, each group

@mreiche we are creating indexes as below.

CREATE INDEX adv_idx ON CRM(zone,regnw) WHERE (__t` = “SR”);

Thanks,
Debasis

See my comment above and vsr1’s last comment. Put documents of different types (‘__t’) in separate collections. The count of documents in a collection is accessible without a scan. Otherwise - it will require a scan.

You are counting based on group, keeping separate collection also will not help

SELECT network, zone, regnw, count(1)
FROM CRM AS c
WHERE __t ="SR"
GROUP BY network, zone, regnw ;

CREATE INDEX i1 ON CRN(__t, network, zone, regnw);

Use above index. 70M entries take time. Depends on how many types and how many groups and how often changes you can generate new rows in different collection consolidated results and get those with out aggregates or cache results.

UPSERT INTO newcol1 (KEY k, VALUE v )
SELECT CONCAT2("_", __t, TOSTR(network), TOSTR(zone), TOSTR(regnw)) AS k.
       {__t, network, zone, regnw, "cnt":count(1)} AS v
FROM CRM AS c
WHERE __t IS NOT NULL
GROUP BY __t, network, zone, regnw ;
1 Like

Thanks, @vsr1 & @mreiche for your thoughts.

@vsr1, The approach that you suggested seems like a shift-left of the GROUP BY, meaning that instead of running the aggregation when counts are needed, we do it at the time of mutation. Is that correct?

However, since mutations occur at a rate of hundreds or thousands per second, this approach may not be feasible.

Additionally, since the mutation happens via KV access, the application would require two network round trips—one for the KV-based mutation and another for the UPSERT.

We had considered a similar approach where the mutation happens via KV, and we maintain a separate document, incrementing a count by 1 each time using another KV mutation. This eliminates the need for an UPSERT, but it still suffers from the same challenge—two network round trips!