We’ve struggled for a while now (about a year) in getting our CB 4.6 cluster to really perform on the Azure platform.
This is an issue we’ve put a fair amount of time & effort into. It’s gotten better over the past year with some investments in Azure High Scale VMs, but still doesn’t seem anywhere close to Azure’s High Scale perf targets.
We have a three-node cluster of DS3_V2 VMs. While DS3’s aren’t very “big” VMs, they have a pretty high I/O targets when you consider the High Scale VMs SSD disks and local-SSD caching.
Here’s a common example for us: We commonly get index scan timed out errors for larger aggregate queries against our primary bucket. The bucket is only 72GB, and each node has SSD storage capable of 2,300 IOPS.
Furthermore, specifically for the index scan timeout, it’s my understanding that Couchbase scans the index and writing some temporary results to a file called the “scan-backfill” to /tmp. In our cluster /tmp is mapped to a separate SSD disk that backed by a local SSD cache capable of 21,000 IOPS (so roughly 9x faster than the bucket disks).
So overall we’re struggling with why we’re seeing index scan timeouts when the storage behind our cluster (appears to be) some of the fastest cloud-storage available.
Is there some way to get more detailed information on what the bottleneck is for our index scans?
Also, is it possible that Couchbase is doing SYNC writes to the scan-backfill instead if ASYNC writes? If so, it’s my understanding that this has some pretty major performance implications (since a SYNC write usually isn’t necessary to a temporary file).