If possible I’d like to understand your sizing and I/O requirements. This will help us get to the bottom of the issue.
I suspect this is a high ops use-case or data is being generated quickly and also being ejected often due to limited memory in the bucket.
To understand better though we should also compartmentalize the errors to their respective owners since something is not responding properly.
A temporary failure should be from the client side, is that where you’re seeing it? In the output of the Node App? It could be related to the metadata issue but I suspect more is going on so please keep reading.
Metadata overhead is directly an issue with the data service and is directly impacted by sizing. If you’re not using “Full Ejection” then you’re much more at the mercy of the bucket quota which is a component of the total allocated memory quota for the data service itself. You set an amount of memory for the data service when you setup the nodes, then you created one or more buckets with its own memory allocation.
You mentioned using/wanting to use 2 replicas as well. Keep in mind that if you have a lot of data and want to use all of it for performance but have additional copies of data, there is extra overhead for each of the replicas to also maintain data in memory when using “value ejection”.
Couchbase replicas ARE NOT like mongodb replicas. Couchbase only needs 1 and distributes it to all the nodes so all nodes protect for a single failure. This should be good for most use-cases.
The caveat between setting Value Ejection and Full Ejection for buckets should be discussed that Value ejection is great for high-performance use-cases when you have a lot of RAM, Full Ejection use-cases is great when you have large data and Less RAM but can predict what data is needed over a time period to reduce cache misses and also have good disk I/O.
One last item to understand would be about setting up the cluster properly from an OS perspective. If on Linux you’ll want to make sure THP is disabled and your swap is configured for the OS but vm.swappiness=0 or 1 should be set.
So back to the sizing impact thoughts I started off with.
If you could let us know a few things we might be able to identify the culprit.
- document Key and Value size (key is the document ID length, value is the JSON body size), and total document count.
- Memory and CPU per node
- Number of existing nodes running “data service”
- Bucket size and cluster memory quota
- full ejection or value ejection
- compaction setting (if modified from default only)
- operations per second for the bucket(s) in question and what kind i.e. 30,000 ops split between 10000 writes and 20000 reads
- Cache misses (how many per second?)
Those are fairly generic pieces of data and couchbase doesn’t require 2 replicas if you have 3 nodes as I described earlier.
One is typically sufficient and better on your I/O budget than 2 replicas which improves both the amount of data you can story in memory and on disk as well as the I/O consumption of the persistence operations to store data to disk.
Couchbase is strongly consistent but the slower your disks are the longer it takes writes to disk which can impact availability due to simply not being able to move data out of RAM to make space for more data. This is a SIZING issue.
An easy fix for this in AWS for instance is to use much larger disks than needed so you can support data writing performance to that disk while EJECTION is happening. Keep in mind Full Ejection can increase the IOPS needed because of IO amplification during the ejection process to maintain the strongly consistent data store couchbase provides.
Let us know how things are going and I hope we can help through the forum.