We’re managing a bucket with 800M+ documents, which is approaching 1B documents.
The typical document has different fields, but the ones included 100% in the query are “BaseTopic” and “AcquisitionTimestamp”.
We used the index advisor, but we observed too much results with the different possible queries. We cannot afford a multiple index creation since we’re on an edge environment with low resources.
Currently we’re using an index like the following:
CREATE INDEX “idx1” ON bucket.scope.collection (AcquisitionTimestamp, BaseTopic) PARTITION BY HASH (BaseTopic).
BaseTopic is a string like “Environment/DeviceCategory/Device/Event”, while AcquisitionTimestamp it’s a string representing the exact timestamp when the event has been received.
Queries are usually filtered by BaseTopic and AcquisitionTimestamp in a frame of 1 day (e.g. AcquisitionTimestamp >= “2024-07-11T00:00:00” AND AcquistionTimestamp < “2024-07-11T23:59:99.999999”.
Would the index be more efficient partitioning also on the AcquisitionTimestamp?
Is the fields order impacting the index performance (e.g. partition by “AcquisitionTimestamp, BaseTopic” is different than “BaseTopic, AcquisitionTimestamp”)?
I’m not the index expert, but if you query on the ‘day’ portion of the timestamp, then maybe use the ‘day’ portion of the timestamp in the index and query on that (with substr)? The predicate can be for equality instead of two inequalities.
Hi @mreiche many thanks for your suggestion!
Since we cannot try new indexes at the moment, I would like to better understand the impact of adding the AcquisitionTimestamp in the partitioning, as it is right now.
As you have predicate on BaseTopic and it is most likely equality or IN. So move that leading key so that indexScans will be faster due to it needs to scan only subset of index. If AcquisitionTimestamp is leading key, the query predicate will be range and BaseTopic is spread across index and may need more disk reads. Rule#11 Right Index Option: How to Get the Right Performance
Not sure how many BaseTopic you have if reasonably small then partial index for each type might be better due to length of the BaseTopic may be 40 bytes now look index having 1B*40 bytes (40GB) bigger.
CREATE INDEX ix1 ON bucket.scope.collection (BaseTopic, AcquisitionTimestamp)
PARTITION BY HASH (BaseTopic)
For AcquisitionTimestamp is PARTITION , it must be immutable. Also most queries does range scan, So it can’t do partition elimination(must do scatter gather). Only advantage is if index doesn’t fit one node or partition entries are skewed (i.e. one partition has lot and others less).
Another option remodel bucket to collection and each collection store specific BaseTopic and avoid BaseTopic in document and index which can save more
As @mreiche mentioned if you only query one day you can use SUBSTR(AcquisitionTimestamp,0,10) in index and query or DATE_STR_FORMAT() .
Best option for not changing queries is create new index and drop old index afterwords you will see immediate impact.
I tried on 10M documents, but I’m seeing no performance improvements, even adding the BaseTopic as first index field.
I also tried to partition by “BaseTopic,AcquisitionTimestamp” but didn’t work.
Additionally, I tried to substitute each reference to AcquisitionTimestamp with SUBSTR(AcquisitionTimestamp,0,10) but didn’t get the wanted results.
I partitioned the index also with SUBSTR(AcquisitionTimestamp,0,10) and queried with SUBSTR(AcquisitionTimestamp,0,10) = “YYYY-MM-DD” but nothing improved.