I’m running Couchbase 4.1.1-5914 Community Edition (build-5914) and using XDCR to sync with an elasticsearch cluster.
Just as it gets close to being 100% done syncing, it resets down to 75% or so and begins climbing again.
Nothing special in the logs, and the server is otherwise behaving normally. It just seems as though a rather massive number of documents keep re-syncing.
A quick small screenshot of what I’m seeing in the web interface:
Any thoughts here on where I could look to try to diagnose? I’m hoping this is just something obvious someone else has seen.
A bit more history:
It’s been working as expected then one of the CB servers in the cluster failed (it wasn’t really an issue as we have enough redundancy in the network and auto-failover) and suddenly the XDCR started going a bit crazy. We brought back up the failed node and since it’s been in that jigsaw behavior.
This happened once before and we solved it by completely removing the XDCR configuration and syncing from scratch with elasticsearch. Then it was fine until the same thing happened, a node failed (for as yet, unknown reasons) and back to this weird behavior.
It’s not critical since everything is still working fine, ES gets updated pretty quickly with new changes, but it’s causing far too much load until we create a new ES index and sync everything from scratch. Clearly something funky is going on. Hopefully some fresh eyes will have an idea.
As was the case before, we created a new elasticsearch index, setup a second replication to the second index. Once it was done switched over to the new index and deleted the replication to the old index and it’s not engaging in the nonsense it was before.
I’d love to have some idea how to prevent it from doing that again if we ever need to take a server in the cluster offline. Having to completely reindex elasticsearch from scratch every time the cluster size changes seems… well… less than ideal.
Hi Courtney,
This may be a bug. The stat “mutations_failed_resolution” (renamed “mutations skipped by resolution” in 4.5) indicates that XDCR found that the items were already replicated to Elasticsearch so it didn’t need to send them. However, the odd thing is that the mutations aren’t dropped from the XDCR replication queue once that check is made. Were all of these cases preceded by a failure of a Couchbase Server node, or was there any other pattern that seems significant?
-Will
Yes it’s always preceded by the cluster size changing. It seems a rebalance from either adding a node or removing a node will make this occur. Early on we assumed this issue was a sizing problem so we kept increasing the cluster sizes on both sides but that clearly wasn’t the culprit.
Hi Courtney,
This is Yu, a developer in XDCR team. It is expected that replication would restart when cluster size changes. It should not keep restarting after the cluster size change completes, though. Can you please attach the goxdcr.log when the problem occured?
We’ve been trying to avoid reindexing to see if there’s a way we can get it to settle down without having to completely reindex every time the cluster size changes and after doing some rolling restarts of the nodes it’s settled on this behavior:
this is a 1 hour timescale
The data is not heavily updated, so there’s no reason why it should be doing whatever the heck it’s doing. There’s maybe a few dozen sets per second. It’s mostly heavy on the read side.
I was going to attach a gzipped log from today with the above (not knowing what parts of the log may be helpful for you), but uploading gzip files is not allowed here. Perhaps I could email it?
Hi Courtney - there’s a built in function in Couchbase Server web admin that uploads zipped logs to a secure s3 bucket at Couchbase. If that’s permitted, you can find that option under Log - Collect Info (substitute your IP here:
http://127.0.0.1:8091/ui/index.html#/logs/collectInfo/form )
If you use that, you can add the issue number MB-21927.
-Will
What should I input on the “Upload to host:” field?
You can put in the following:
s3.amazonaws.com/cb-customers
Customer name ‘courtney’, or if you put in something else, please update the ticket so we know what it is.
It only allowed numeric values for the ticket number so I just used 21927.
It’s up there now.
Let me know if you need anything else. In the meantime the elasticsearch cluster is able to handle the unnecessary load and we’re refraining from re-indexing from scratch hoping to find a better solution. Since it’s replicating to es pretty fast still, we can live with it while hunting for that better solution.
Just deleting the es index, deleting the xdcr config and starting from scratch definitely fixes it though.
Can you email goxdcr.log to me at yu@couchbase.com?
Emailed that in addition to the “collect information” option.
In case this is helpful as well
It keeps repeating that behavior from 1:45pm - approx 2:20pm.
Courtney,
From the log file, replication kept restarting because of malformed response error, which is the same issue as that in MB-20937. Unfortunately, MB-20937 is not fixed until 4.5.1.
This issue typically shows up when mutations are sent to target in large batches. This issue did not show up before the cluster resizing probably because the mutation batches were smaller then because of low data incoming rate. After cluster resizing, checkpoint documents were lost for vbuckets that were moved during resizing, and mutations in such vbuckets needed to get re-replicated. This resulted in large replication batches and caused the malformed response errors.
One way to fix this issue is to upgrade to 4.5.1. A workaround is to reduce the “XDCR batch size” setting to make the replication batch size smaller. From the log file the average document size is ~500 bytes. So if we reduce “XDCR batch size” from the default 2048(kb) to 50(kb), each batch would hold no more than 100 documents, as compared to 500 documents by default. If needed the batch size can be reduced further to as small as 10(kb).
1 Like
It seems that there is no community edition of 4.5.1. Upgrade to 4.6 would be required.
@courtney
Nothing particular to do with your XDCR issue, but I suggest you review your Couchbase configuration. Check the recommendations outlined here for THP, swap space and swappiness: http://developer.couchbase.com/documentation/server/4.5/install/install-linux.html
There is evidence that the OOM killer was active on at least one of your nodes. That may be related to those settings. Couchbase Support would have more insight than I do, but the OOM killer activity might be what caused your original node failures. You might want to look into your logs further. Whatever the root cause of the node failures, the configuration recommendations above are a good idea.
-Will
Missed that THP option, but had all the swap config. Not sure how we missed that!
Anyway, less than 50% of the memory on each node is allocated to Couchbase and it still seemed to be eating up memory. XDCR seemed to eat up more memory than expected so we’ve been cutting back the memory allocated to Couchbase incrementally. I was kind of thinking that the out of control XDCR memory usage had to do with whatever replication issue was happening which ultimately lead to a node using too much memory. goxdcr was eating up something like 12 gig of memory on each node at one point.
In any case, the workaround above on the issue seems to at least cut down on the ES load. It’s still never fully syncing and definitely still isn’t quite working but it’s more tolerable (set the batch size to 50kb).
I don’t think using the developer preview 4.6 to fix this would be wise. Last time when we tried the 4.5 developer preview it performed quite poorly.