Consistent XDCR error when replicating to more than 2 clusters in the new 2.2 release

I saw this similar problem when I first tested with the 2.1 release. I’m in the amazon cloud environment. I am replicating from one CS 2.2 cluster to several other CS 2.2 clusters using XDCR. When i replicate from 1 cluster to another, I never see an issue. When i replicate from 1 cluster to 2 clusters, I still don’t see an issue. However, when i replicate from 1 to 3 or more clusters, I fairly consistently see the following behavior:

The replication is going fine until all of a sudden errors start displaying in the console of the source (my replication is always one way). The behavior is for the replication rate to all destinations to slow down dramatically. In fact, in looking at the destination console (for all 3 destinations), there is some activity, then for about 30 seconds there is NO activity, then again some activity (which may last for 30 seconds or so, then no activity (again, for about 30 seconds). The end result is that all the documents are successfully replicated, but once I get into this state, the transfer rate is reduced dramatically and performs suffers.

So, i looked in the log files. What is I see in the source log file (xdcr_errors.#) is the following:

[xdcr:error,2013-09-21T13:37:43.853,ns_1@machineName.compute-1.amazonaws.com:<0.27099.25>:xdc_vbucket_rep:terminate:398]Replication (XMem mode) ba6ef 7badea24f4fcd2e1b9e938b7b99/cust1_app1/cust1_app1 (cust1_app1/250 -> http://*****@ec2-54-246-81-99.eu-west-1.compute.amazonaws.com:8092/cust1_app1%2f250% 3b7857d92c94752330262c7567bcf17792) failed.Please see ns_server debug log for complete state dump
[xdcr:error,2013-09-21T13:37:43.858,ns_1@ec2-107-22-165-101.compute-1.amazonaws.com:<0.13879.208>:xdc_vbucket_rep:handle_info:90]Error initializing vb replic
ator ({init_state,
{rep,
<<“ba6ef7badea24f4fcd2e1b9e938b7b99/cust1_app1/cust1_app1”>>,
<<“cust1_app1”>>,
<<"/remoteClusters/ba6ef7badea24f4fcd2e1b9e938b7b99/buckets/cust1_app1">>,
“xmem”,
[{optimistic_replication_threshold,256},
{worker_batch_size,500},
{failure_restart_interval,30},
{doc_batch_size_kb,2048},
{checkpoint_interval,1800},
{max_concurrent_reps,32},
{connection_timeout,180},
{worker_processes,4},
{http_connections,20},
{retries_per_request,2},
{xmem_worker,1},
{enable_pipeline_ops,true},
{local_conflict_resolution,false},
{socket_options,
[{keepalive,true},{nodelay,false}]},
{trace_dump_invprob,1000}]},
250,“xmem”,<0.12755.0>,<0.12756.0>,
<0.12751.0>}):{error,
{badmatch,
{error,all_nodes_failed,
<<“Failed to grab remote bucket cust1_app1from any of known nodes”>>}}}
[

Now, in each of the destination CS instances (all clusters are of size 1), I see the following error at around the same time as the above error in the source:

[user:info,2013-09-21T14:52:36.205,ns_1@machineName.us-west-2.compute.amazonaws.com:<0.1808.0>:ns_log:crash_consumpti
on_loop:64]Port server moxi on node ‘babysitter_of_ns_1@127.0.0.1’ exited with status 0. Restarting. Messages: 2013-09-20 14
:03:32: (cproxy_config.c.315) env: MOXI_SASL_PLAIN_USR (9)
2013-09-20 14:03:32: (cproxy_config.c.324) env: MOXI_SASL_PLAIN_PWD (9)
2013-09-20 14:03:35: (agent_config.c.703) ERROR: bad JSON configuration from http://127.0.0.1:8091/pools/default/saslBuckets
Streaming: Number of vBuckets must be a power of two > 0 and <= 65536
2013-09-20 14:03:48: (agent_config.c.703) ERROR: bad JSON configuration from http://127.0.0.1:8091/pools/default/saslBuckets
Streaming: Number of vBuckets must be a power of two > 0 and <= 65536
EOL on stdin. Exiting
[ns_server:info,2013-09-21T14:52:40.996,ns_1@ec2-54-214-254-175.us-west-2.compute.amazonaws.com:<0.8271.0>:mc_connection:run
_loop:202]mccouch connection was normally closed
[user:info,2013-09-21T14:52:40.997,ns_1@ec2-54-214-254-175.us-west-2.compute.amazonaws.com:ns_memcached-cust1_app1<0.8258.0>
:ns_memcached:terminate:738]Control connection to memcached on ‘ns_1@ec2-54-214-254-175.us-west-2.compute.amazonaws.com’ dis
connected: {{badmatch,
{error,
closed}},
[{mc_client_bina
ry,
cmd_binary_voc
al_recv,
5},
{mc_client_bina
ry,
select_bucket,
2},
{ns_memcached,
ensure_bucket,
2},
{ns_memcached,
handle_info,
2},
{gen_server,
handle_msg,
5},
{ns_memcached,
init,
1},
{gen_server,
init_it,
6},
{proc_lib,
init_p_do_appl
y,
3}]}
[ns_server:info,2013-09-21T14:52:40.996,ns_1@ec2-54-214-254-175.us-west-2.compute.amazonaws.com:<0.2008.0>:mc_connection:run
loop:202]mccouch connection was normally closed
[error_logger:error,2013-09-21T14:52:40.997,ns_1@ec2-54-214-254-175.us-west-2.compute.amazonaws.com:error_logger<0.6.0>:ale

error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
crasher:
initial call: erlang:apply/2
pid: <0.2012.0>
registered_name: []
exception error: no match of right hand side value {error,closed}
in function mc_binary:quick_stats_recv/3
in call from mc_binary:quick_stats_loop/5
in call from mc_binary:quick_stats/5
in call from ns_memcached:do_handle_call/3
in call from ns_memcached:worker_loop/3
ancestors: [‘ns_memcached-default’,‘single_bucket_sup-default’,
<0.1984.0>]
messages: []
links: [<0.1998.0>]
dictionary: []
trap_exit: false
status: running
heap_size: 4181
stack_size: 24
reductions: 280781421
neighbours:

As stated above, I saw this problem in 2.1 and now in 2.2, and when i replicate from 1 to 3 clusters, it happens about 50% of the time.

To clarify one thing. I saw this behavior in 2.1 where by behavior i mean that after some period of time, the replication would slow down, and the pattern until completion was that some data is replicated - then the console on the destination should no operations for about 30 seconds - then the console shows some more data come in - then nothing again for 30 seconds - and this continues until eventually all documents have been replicated.

[{optimistic_replication_threshold,256},
{worker_batch_size,500},
{failure_restart_interval,30},
{doc_batch_size_kb,2048},
{checkpoint_interval,1800},
{max_concurrent_reps,32},

looks like your are hitting the failure restart interval
{failure_restart_interval,30},

In 2.2 they have a new replication method that goes from Memecached to memcached instead of disk to disk between the two clusters. its called “VERSION 2” in XDCR option.

you might be having CPU usage issues or bandwidth issues.
{max_concurrent_reps,32}, = the number of vbucket at the source that is sending data to the target at any given time.

When you replicate 3x you are really doing 32 x 3. Try lowing it to 10.
Also install iftop to track how much data is going out.

To check CPU usage during I like using htop.

If it is bandwidth you can solve it by adding more nodes. because the 32x3 vBuckets sending data is spread out through more machines.

Also if you are doing Master->Slave change
[{optimistic_replication_threshold,256},
to 20MB instead of 256KB.

Reducing max_concurrent_reps helped. I’ll also try the other ideas today. One question though. Once CS gets into this state, is there a way to get it out of that state?

What I think you are asking is “if and when I reach a point like before where XDCR was no responsive can I ‘Kick’ it some how to get it back on track?” … Not Really. You will have to Delete the XDCR connection and go into the xdcr.1 -[somenumebr] logs and debug it. The logs will pretty much tell you that source XYZ server could not connect to Bucket ABC in the other Cluster for [somereason] .

This is what i’ve noticed. Remember, my xdcr testing is to add some million documents to a master CS db - wait till it replicates all those docs to all the slaves - then delete all those docs from the master - and again, wait till all those replications complete - then start over again.

Based on the max-concurrent-reps value, i may get into the state i describe above. I then modify the settings for each of the buckets ( i reduce the max-concurrent-reps value). I notice that the current replication continues limping along (let’s assume i was in the inserting part of the xdcr test). After all the docs are finally replicated, i think run my test to delete the docs from the master - this then runs fine to completion. I never see the issue above, and this occurred without deleting the xdcr connection.

i couldn’t modify the comment above - replace ‘think’ with ‘then’.

One other thing. Let’s say i start to get errors during a replication and notice the behavior above on all the destination clusters - add a few thousand docs - do nothing for 30 second - add a few thousand docs - do nothing for 30 seconds. If i lower the max-concurrent-reps value for each xdcr replication, I notice that i get no new errors (if i click on the ‘last 10 errors’ link, the errors have NOT changed), but the limping continues until the ‘current’ replication completes.