I did some research on the forums and there are plenty of posts about the “Index scan time out” error message, but the suggestions are mostly focused around changes to the index to make it return in less than two mins. We can definitely go that route, but I have some other questions before doing so:
Is there a way to simply increase the two-minute limit if we know we have some large indexes?
We assumed it was “cheaper” for Couchbase to maintain one large index instead of many smaller indexes. It’s definitely “cheaper” from a development perspective in that ideally we don’t have to create new indexes when on-boarding new clients.
We have a covering index that includes some pretty small data fields, but the index is HUGE. Here are some details:
Total documents in bucket: 14M
Total bucket data size: 17GB
Size of index: 9GB (size on disk when viewing the index file)
Num of docs in index: ~5M
We’re only pulling three small properties out, which average 60 characters all together. So the question being, how does 60 character of data across 5M docs get blown up into a 9GB index?
I also posted the following question, which we wonder is contributing to the hitting the two-minute limit.
To answer the question of “why” we have these large indexes, it’s mostly for aggregate queries. Here’s another example of query that hits the two-minute limit, but you would expect to be pretty quick over an indexed 14M doc dataset:
SELECT SUBSTR(meta().id,0,POSITION(meta().id,"-")), COUNT(meta().id)
FROM cloud_med
GROUP BY SUBSTR(meta().id,0,POSITION(meta().id,"-"))
We format our document keys into collections of sorts like this: type-client_name-unique_id. So the query above is handy to report on what “type” of data is contributing to the total document count. But this query almost always hits the two-minute index scan limit against the primary index.
Anyone have any ideas on this? I’ve seen some posts that talk about over-riding the timeout values using the REST API, but I’m not sure if those time-out values apply to the index scans.
SELECT SUBSTR(meta().id,0,POSITION(meta().id,"-")), COUNT(meta().id)
FROM cloud_med
GROUP BY SUBSTR(meta().id,0,POSITION(meta().id,"-"))
To do the grouping of the entire data set, we need to get the whole index data and then do the grouping.
There are couple of optimizations in CB 4.5 for you to exploit this by writing a small script:
SELECT phylum, COUNT(1) FROM bucket group by phylumn;
Step 1. SELECT min(phylum) from bucket where type = ‘blah’;
step 2: select count(1) from bucket where < full condition> AND phylum =
step 3: SELECT min(phylum) from bucket where type = ‘blah’ where phylum > “the previous value
step 4. repeat 2, 3 till you no longer get the min(phylum) value.
Now you have full result with least processing.
–
CB 4.5 optimizes some of the COUNT() and does a quick look up of MIN() using the index order.
Index scan time out can be increased using the below setting change. The “ip” needs to be of the node where the index service is running.
curl -X POST -u <user:pass> http://<ip>:9102/settings --data '{"indexer.settings.scan_timeout": <timeout_in_milliseconds>}'
But before changing the timeout, it is better to try to improve the query/index definition(see the next point).
If your queries do not have a where clause(e.g. the one mentioned in this post), smaller indexes are better. Otherwise the full large index is going to be scanned for every query which is not a good idea.
Also, if the small indexes are defined with “where” clause and the same predicate is used in the query, n1ql only needs to work with the small index and performance would be much better.
Along with the indexed fields, documentId is stored as part of the index storage. There is some write amplification in the storage as it needs to maintain the MVCC snapshots. What is the amount of fragmentation being reported for the index? If there is high fragmentation, disk usage would come down after compaction run.
Can you kindly elaborate on what do you mean by the curl request not working. Is the request not going through (or) was the request successful but the indexer did not honour the scan timeout setting? Will it be possible to share the logs for further analysis?