Index suggestion

fabiosst · July 13, 2024, 11:58am

We’re managing a bucket with 800M+ documents, which is approaching 1B documents.

The typical document has different fields, but the ones included 100% in the query are “BaseTopic” and “AcquisitionTimestamp”.

We used the index advisor, but we observed too much results with the different possible queries. We cannot afford a multiple index creation since we’re on an edge environment with low resources.

Currently we’re using an index like the following:
CREATE INDEX “idx1” ON bucket.scope.collection (AcquisitionTimestamp, BaseTopic) PARTITION BY HASH (BaseTopic).

BaseTopic is a string like “Environment/DeviceCategory/Device/Event”, while AcquisitionTimestamp it’s a string representing the exact timestamp when the event has been received.

Queries are usually filtered by BaseTopic and AcquisitionTimestamp in a frame of 1 day (e.g. AcquisitionTimestamp >= “2024-07-11T00:00:00” AND AcquistionTimestamp < “2024-07-11T23:59:99.999999”.

Would the index be more efficient partitioning also on the AcquisitionTimestamp?
Is the fields order impacting the index performance (e.g. partition by “AcquisitionTimestamp, BaseTopic” is different than “BaseTopic, AcquisitionTimestamp”)?

mreiche · July 13, 2024, 2:06pm

I’m not the index expert, but if you query on the ‘day’ portion of the timestamp, then maybe use the ‘day’ portion of the timestamp in the index and query on that (with substr)? The predicate can be for equality instead of two inequalities.

fabiosst · July 15, 2024, 12:35pm

Hi @mreiche many thanks for your suggestion!
Since we cannot try new indexes at the moment, I would like to better understand the impact of adding the AcquisitionTimestamp in the partitioning, as it is right now.

vsr1 · July 15, 2024, 1:53pm

As you have predicate on BaseTopic and it is most likely equality or IN. So move that leading key so that indexScans will be faster due to it needs to scan only subset of index. If AcquisitionTimestamp is leading key, the query predicate will be range and BaseTopic is spread across index and may need more disk reads. Rule#11 Right Index Option: How to Get the Right Performance

Not sure how many BaseTopic you have if reasonably small then partial index for each type might be better due to length of the BaseTopic may be 40 bytes now look index having 1B*40 bytes (40GB) bigger.

CREATE INDEX ix1 ON bucket.scope.collection (BaseTopic, AcquisitionTimestamp) 
PARTITION BY HASH (BaseTopic)

For AcquisitionTimestamp is PARTITION , it must be immutable. Also most queries does range scan, So it can’t do partition elimination(must do scatter gather). Only advantage is if index doesn’t fit one node or partition entries are skewed (i.e. one partition has lot and others less).

Another option remodel bucket to collection and each collection store specific BaseTopic and avoid BaseTopic in document and index which can save more

As @mreiche mentioned if you only query one day you can use SUBSTR(AcquisitionTimestamp,0,10) in index and query or DATE_STR_FORMAT() .
Best option for not changing queries is create new index and drop old index afterwords you will see immediate impact.

fabiosst · July 16, 2024, 4:04pm

Hi @vsr1, thanks for your suggestion.

I tried on 10M documents, but I’m seeing no performance improvements, even adding the BaseTopic as first index field.
I also tried to partition by “BaseTopic,AcquisitionTimestamp” but didn’t work.

Additionally, I tried to substitute each reference to AcquisitionTimestamp with SUBSTR(AcquisitionTimestamp,0,10) but didn’t get the wanted results.
I partitioned the index also with SUBSTR(AcquisitionTimestamp,0,10) and queried with SUBSTR(AcquisitionTimestamp,0,10) = “YYYY-MM-DD” but nothing improved.

vsr1 · July 16, 2024, 5:08pm

Follow this and see where the time spent

system · October 14, 2024, 5:08pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Considerations while creating index SQL++ n1ql , index	3	1114	February 14, 2018
Indexes in Analytics using N1QL query for Optimize Analytics n1ql	3	995	January 11, 2022
GROUP BY performance SQL++ java , n1ql	13	3286	July 13, 2017
Initial index performance SQL++ query , n1ql , index	10	2134	May 8, 2017
Designing a multi-purpose index SQL++ n1ql	5	805	August 8, 2018

Index suggestion

Related topics