Create index is slow on the huge no. of documents

I am creating two index on a Couchbase cluster.

One is the primary index, the other is a secondary. I am creating these indexes on approximately 30 billions documents. The secondary index is for 3 elements.

I started to create them 7 hours ago. However the current progresses are 35 and 39 percent.
Is it usually take a time to create indexes on such huge data or is something is wrong on my environment ?
When do you think that the creation will finish ?

The size of the cluster is 4 nodes (16 cores for each) and the total index RAM quota is 10GB. 2 nodes are index server.

The index Settings is as follows:

Indexer Threads: 8
In Memory Snapshot Interval: 200 ms
Stable Snapshot Interval: 5000 ms
Max Rollback Points: 5
Indexer Log Level: info

Thanks

The index build times can be high for a few reasons;

  • retrieval of the information from data service is slow.
  • index nodes can’t save the index to disk fast enough

There are a few options;

  • use defer_build option to build both indexes together. defer build will ensure you scan once and build both indexes.
  • you could also partition your indexes and get more nodes to parallelize your index build. for partitioning you can specify a filter (WHERE clause in CREATE INDEX). However I should note that there may be some queries that may not be able to take advantage of range scans in the index that is partitioned.
  • Last, We have another option in 4.5 called memory optimized indexes that can build the index much faster in memory - however given the count of the docs, I don’t think you will be able to fit your index into memory.

What is the document key size and index key size? just curious.
thanks
-cihan

Hi cihangirb,

Thank you for your reply. The index key size is 45 bytes.
I am using 4 nodes for the cluster and each node is 16 cores and SSD storage on AWS.
I don’t think retrieval or save is slow, but what do you think ?

Thanks

@webber - Have you ever found a solution for this problem ?

@eldorado,

Just for your information, the underlying storage engine probably used when @webber tried this use case was ForestDB (Considering that the time of initial post is May’16). The current underlying storage engine being used is Plasma which is very different and better performant when compared to ForestDB.

Thanks,
Varun

@varun.velamuri - Sure … I know plasma is better bet than ForestDB but was looking for information on what was his choice if he ever resolved the issue. Lot of cases I see dangling closure of threads with no solution . So would be really helpful to close case with resolutions . but thanks for pointing out .