Hello. I gave up to solve my issue myself and I didn’t find it described elsewhere.
My usecase:
I intended to use Couchbase as a very simple storage for a lot of data. It’s my secondary storage, the real load is elsewhere. I have two buckets. One big when documents are stored once and never edited with almost 1 billion documents. The second one is smaller (under 100 million documents), updates are possible. I have no secondary indices, so it’s basically a key value store, although the documents are structured and have approximately15 attributes. My aim is few dozens writes per second 24/7 and occassional reads (much less compared to writes). Active dataset should be tiny (well under 1%). It’s something like a time series, although not 100%. I don’t mind cache misses on reads, I’m more concerned about stable writes. I need the cluster to be as maintenance-free as possible - I want to monitor the beast and add a new node when time comes.
Cluster setup:
I have 4 identical VM nodes dedicated to Couchbase only:
- 4 cores
- 4 GB RAM
- 1 TB HDD
- Ubuntu 14.04.4 LTS
- nothing unusual
Couchbase server 4.0.0 CE
Data RAM quota: 2048 MB
Index RAM quota: 1024 MB
The Data RAM quota is spread evenly among both buckets (if I understand this right).
Both buckets use full ejection and have 2 replicas.
Both data and index are on the same partition (I know it’s not recommended).
The problem
The data caused the cluster to be more fragmented than I expected during pre-production filling-up, cluster failed completely to send e-mail alerts (I believe this bug is alive at least since 2.0 but never mind now) and disks got full. I added another node (I had three originally), run reballance (OK) and compaction (OK). But during the compaction the cluster begun to use more and more RAM. In the end it reported over 1 GB of overused RAM. I stopped all traffic coming to the cluster since I hit the reballance. It didn’t work well anyway during the reballance (a lot of time-outs) and the cluster was accepting virtually no writes during the compaction. But never mind this as well. Since then the cluster is stuck on a severe RAM overuse and it refuses all reads and writes reporting an out of memory error. I tried a series of 1000 writes and 1000 reads separated by 1 second - just to nudge the cluster gently into some action. But no ejection took place and the cluster was still as good as dead.
Something from the OS:
- memcached process was taking about 2.1 to 2.6 GB of RAM (well over the quota)
- beam process was taking about 0.5 GB RAM
- no other significant memory consumers, no sigificant CPU usage
In the end I restarted the couchbase service on one of the nodes, the cluster is reballancing at the moment and I expect that by restarting all nodes sequentially the RAM will be freed and cluster will be operational again eventually. I thing it’s clear that having a completely dead cluster after each compaction is unacceptable. Am I doing something wrong? Is Couchbase able to fulfill my usecase?
Advice would be dearly appreciated. I’d like to keep the couchbase because I’ve already invested a lot of work into it, but unless I solve this issued I will have to use another storage.
EDIT: all queues were empty, there was no traffic and no leftovers of any sort I would recognize or notice. Everything seemed to be perfectly calm and all right but the memory overuse and the fact it didn’t work at all. I have screenshots from the admin interface, but the forum does not let me to upload more than one image, so ask away I you want to know more information.