No matter how we slice it, cbbackup is pretty slow. We have a support contract and have spent many a precious hour with Enterprise Support trying to make it faster. We know it’s not directly hardware related since XDCR to the exact same machine we’re using as the cbbackup destination is very fast. Although it’s possible that cbbackup and XDCR use the underlying storage in different ways.
For example, the backup target is able to flush XDCR mutations to disk at 5K-10K docs/sec. But a cbbackup job can hardly achieve even a 10th of that performance, even using the “cbbackupwrapper” functionality (per recommended settings for support).
So I’d like to move on from cbbackup if possible. Our core requirement is to take standard grain backups (daily/weekly/monthly) to recover from an undetected data issue (e.g. a process that is incorrectly updating data). Does anyone have suggestions on this? Is anyone successfully using cbbackup for point in time backups and achieving decent performance?
In Couchbase 4.5 we shipped cbbackupmgr as our new enterprise backup tool and this tool has better performance than both cbbackup and cbbackupwrapper. It is currently only tested and supported for Couchbase 4.5 so this may or may not be useful for you. If should work on Couchbase 3.0+, but we haven’t tested all of the different scenarios yet which is why we don’t officialy recommend it to users on non-4.5 clusters
Also, you mentioned having issues with cbbackupwrapper. I’ve run various performance tests and seen speeds of up to 100MB/Sec. In those tests we write to SSD’s and your setup may be different. Can you provide some information on what type of hardware your running cbbackupwrapper on so I can reproduce the performance issues you are seeing?
Thanks for the reply Mike. We are running on Azure, and currently have our couchbase data files on a 4-disk RAID 0 of page-blob azure disks. We recently upgraded our backup node to 4.5, so I took advantage of this to try your new command. Here is what we ran:
Backing up to 2016-06-24T17_50_18.8495694Z
Copying at 3.08MB (about 2h 48m 36s remaining) 894490 items / 1.35GB
cloud_med [==== ] 5.55%
I think the problem is that the backup software isn’t using asynchronous I/O, which completely kills performance on page-blob storage. Notice the disk queue on the D: and E: drives:
You can tell that it’s waiting for a successful write before it queues up the next read, which seems strange for backup software.
We don’t use Async IO libraries in cbbackupmgr right now, but we do have multiple writers. I looked into how the page blob store works on Azure and one thing I noticed was that it automatically calls fsync after every write. Also, it seems that page blobs are often network attached which means the cost of a multiple writes with an fsync after each can be large. I’m not sure though whether or not the setup you are using is local or remote. In any case I think the automatic fsyncs may be what is causing the slowness.
The way the writes currently work is to batch up as much a 2GB of data and then flush it all to disk. This flush may however consist of many separate writes, but we don’t explicitly call fsync until after we do the last write. To investigate this issue more I have filed MB-20048. I would like to reproduce your exact setup so if there is anything else you can add please do.
Thank you, I really appreciate the detailed reply. I’ll collect the details of our configuration (including specifics on the page blob disks) and post them tomorrow.
One other item to note - I stopped the Couchbase service on our “backup” node (an XDCR target from Prod) and did a basic file copy of the couchbase /data folder, and the entire /data folder copied in about 30 mins at around 8 MB/sec. Disk queue length bounced between 4-10 for the disk hosting the data files.
I’m assuming this is more about the backup destination disks than it is about our cluster (since I’m assuming the cluster should not be the bottleneck in this scenario?). Here are the specs on our destination server:
Azure Standard A3 VM - 4 Cores, 7GB Memory
Per Microsoft’s recommendation for best disk performance, we have a single logical volume stripped across 4x Azure Page Blob Disks. The destination disk for our backups is CB_XDCR (E:). Note that we also have XDCR data stored on this volume, but this is not related to the cbbackup backup sets.
I believe you already mentioned that the bottleneck was the backup file writes, but wanted to let you know that we confirmed this in our environment as well.
We upgraded our backup target to a D3 VM on Azure, which includes local SSD. When the backup archive is on local SSD, the backup took 30 mins. We then using regular Windows file copy commands to move the files to slow page-blob storage.
So all total, 30 mins from Couchbase to local SSD, then 10 mins from local SSD to page-blob. Compared to around 3 hrs when cbbackupmgr was writing directly to page-blob.
What was the average backup speed when you ran on SSD’s since I would expect that you can get more than a 6x improvement in your use case. Also, did you use the --threads option to increase parallelism?