I am using Couchbase to load some master data from SQL Server database. One of the issues, I am facing is that the fragmentation goes very high after the data is loaded. I am not sure about the specifics of the data load here since I do not own that process but its something that runs from .Net Code. After this load is complete, I see that the disk space used by Couchbase goes from 21 GB to 154 GB (numbers based on Web Console and I assume Console only shows data size not indexes).
As soon as data load is complete, I run compaction on the bucket and it goes back down to ~ original size. This indicates to me that data volume is not increasing on a net basis.
I am looking for some suggestions to explore on how to perform data load in Couchbase in an efficient manner to avoid fragmentation. Are there any guidelines?
I wouldn’t necessarily worry about compaction - assuming you have auto-compaction enabled and at a suitable threshold, the compaction task will run automatically.
Having said that, for some context there’s two main contributors to compaction in Couchstore (given it’s an append-only format):
Replacing existing documents (as the old document value will be present earlier in the file).
Writing data in small batches (as the B-Tree overhead will dominate if you’re only writing a few documents to disk at once).
(2) is mostly a function of your mutation rate compared to your disk speed:
if your mutation rate is low, or your disks are fast, then little batching occurs (the Data Service optimises the latency of writes - i.e. will aggressively flush any data outstanding, even if it’s only one item). As such, you’ll see higher fragmentation
If your mutation rate is high, or your disks are slow(er), then more batching occurs - and hence the B-Tree metadata cost is amortised over more items (and you end up with fewer old/stale B-Tree nodes in the file).
So far I had a percentage threshold but compaction is allowed to run only during specific time interval. I will try the auto compaction as you indicated.
I will also explore the possibility of optimizing batches with the developers. Is there a way to monitor these metrics - mutation rate, disk speed, latency of writes as you have indicated. I am still new to the console monitoring.