I saw this similar problem when I first tested with the 2.1 release. I’m in the amazon cloud environment. I am replicating from one CS 2.2 cluster to several other CS 2.2 clusters using XDCR. When i replicate from 1 cluster to another, I never see an issue. When i replicate from 1 cluster to 2 clusters, I still don’t see an issue. However, when i replicate from 1 to 3 or more clusters, I fairly consistently see the following behavior:
The replication is going fine until all of a sudden errors start displaying in the console of the source (my replication is always one way). The behavior is for the replication rate to all destinations to slow down dramatically. In fact, in looking at the destination console (for all 3 destinations), there is some activity, then for about 30 seconds there is NO activity, then again some activity (which may last for 30 seconds or so, then no activity (again, for about 30 seconds). The end result is that all the documents are successfully replicated, but once I get into this state, the transfer rate is reduced dramatically and performs suffers.
So, i looked in the log files. What is I see in the source log file (xdcr_errors.#) is the following:
[xdcr:error,2013-09-21T13:37:43.853,ns_1@machineName.compute-1.amazonaws.com:<0.27099.25>:xdc_vbucket_rep:terminate:398]Replication (XMem mode) ba6ef 7badea24f4fcd2e1b9e938b7b99/cust1_app1/cust1_app1
(cust1_app1/250
-> http://*****@ec2-54-246-81-99.eu-west-1.compute.amazonaws.com:8092/cust1_app1%2f250% 3b7857d92c94752330262c7567bcf17792
) failed.Please see ns_server debug log for complete state dump
[xdcr:error,2013-09-21T13:37:43.858,ns_1@ec2-107-22-165-101.compute-1.amazonaws.com:<0.13879.208>:xdc_vbucket_rep:handle_info:90]Error initializing vb replic
ator ({init_state,
{rep,
<<“ba6ef7badea24f4fcd2e1b9e938b7b99/cust1_app1/cust1_app1”>>,
<<“cust1_app1”>>,
<<"/remoteClusters/ba6ef7badea24f4fcd2e1b9e938b7b99/buckets/cust1_app1">>,
“xmem”,
[{optimistic_replication_threshold,256},
{worker_batch_size,500},
{failure_restart_interval,30},
{doc_batch_size_kb,2048},
{checkpoint_interval,1800},
{max_concurrent_reps,32},
{connection_timeout,180},
{worker_processes,4},
{http_connections,20},
{retries_per_request,2},
{xmem_worker,1},
{enable_pipeline_ops,true},
{local_conflict_resolution,false},
{socket_options,
[{keepalive,true},{nodelay,false}]},
{trace_dump_invprob,1000}]},
250,“xmem”,<0.12755.0>,<0.12756.0>,
<0.12751.0>}):{error,
{badmatch,
{error,all_nodes_failed,
<<“Failed to grab remote bucket cust1_app1
from any of known nodes”>>}}}
[
Now, in each of the destination CS instances (all clusters are of size 1), I see the following error at around the same time as the above error in the source:
[user:info,2013-09-21T14:52:36.205,ns_1@machineName.us-west-2.compute.amazonaws.com:<0.1808.0>:ns_log:crash_consumpti
on_loop:64]Port server moxi on node ‘babysitter_of_ns_1@127.0.0.1’ exited with status 0. Restarting. Messages: 2013-09-20 14
:03:32: (cproxy_config.c.315) env: MOXI_SASL_PLAIN_USR (9)
2013-09-20 14:03:32: (cproxy_config.c.324) env: MOXI_SASL_PLAIN_PWD (9)
2013-09-20 14:03:35: (agent_config.c.703) ERROR: bad JSON configuration from http://127.0.0.1:8091/pools/default/saslBuckets
Streaming: Number of vBuckets must be a power of two > 0 and <= 65536
2013-09-20 14:03:48: (agent_config.c.703) ERROR: bad JSON configuration from http://127.0.0.1:8091/pools/default/saslBuckets
Streaming: Number of vBuckets must be a power of two > 0 and <= 65536
EOL on stdin. Exiting
[ns_server:info,2013-09-21T14:52:40.996,ns_1@ec2-54-214-254-175.us-west-2.compute.amazonaws.com:<0.8271.0>:mc_connection:run
_loop:202]mccouch connection was normally closed
[user:info,2013-09-21T14:52:40.997,ns_1@ec2-54-214-254-175.us-west-2.compute.amazonaws.com:ns_memcached-cust1_app1<0.8258.0>
:ns_memcached:terminate:738]Control connection to memcached on ‘ns_1@ec2-54-214-254-175.us-west-2.compute.amazonaws.com’ dis
connected: {{badmatch,
{error,
closed}},
[{mc_client_bina
ry,
cmd_binary_voc
al_recv,
5},
{mc_client_bina
ry,
select_bucket,
2},
{ns_memcached,
ensure_bucket,
2},
{ns_memcached,
handle_info,
2},
{gen_server,
handle_msg,
5},
{ns_memcached,
init,
1},
{gen_server,
init_it,
6},
{proc_lib,
init_p_do_appl
y,
3}]}
[ns_server:info,2013-09-21T14:52:40.996,ns_1@ec2-54-214-254-175.us-west-2.compute.amazonaws.com:<0.2008.0>:mc_connection:run
loop:202]mccouch connection was normally closed
[error_logger:error,2013-09-21T14:52:40.997,ns_1@ec2-54-214-254-175.us-west-2.compute.amazonaws.com:error_logger<0.6.0>:ale
error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
crasher:
initial call: erlang:apply/2
pid: <0.2012.0>
registered_name: []
exception error: no match of right hand side value {error,closed}
in function mc_binary:quick_stats_recv/3
in call from mc_binary:quick_stats_loop/5
in call from mc_binary:quick_stats/5
in call from ns_memcached:do_handle_call/3
in call from ns_memcached:worker_loop/3
ancestors: [‘ns_memcached-default’,‘single_bucket_sup-default’,
<0.1984.0>]
messages: []
links: [<0.1998.0>]
dictionary: []
trap_exit: false
status: running
heap_size: 4181
stack_size: 24
reductions: 280781421
neighbours:
As stated above, I saw this problem in 2.1 and now in 2.2, and when i replicate from 1 to 3 clusters, it happens about 50% of the time.