When we try to setup XDCR replication to our remote data center, one of our production node always fails 3-6 hour into it. When this occurs, couchbase initiates a fail-over. CPU, Memory, and Disk resources seem to be adequate. We could really use some help in understanding why this occurs. Looking at the logs, we see the following:
- Replication from bucket “default” to bucket “default” on cluster “Couchbase DR” created.
- (6 hours later) Could not auto-failover node (‘ns_1@10.0.100.103’). There was at least another node down.
- Starting failing over ‘ns_1@10.0.100.103’
- Shutting down bucket “default” on ‘ns_1@10.0.100.103’ for deletion
- Failed over ‘ns_1@10.0.100.103’: ok
- Then the fail-over data:
Node (‘ns_1@10.0.100.103’) was automatically failovered.
[{last_heard,{1486,461532,134206}},
{outgoing_replications_safeness_level,[]},
{incoming_replications_conf_hashes,[]},
{active_buckets,[]},
{ready_buckets,[]},
{local_tasks,[[{type,xdcr},
{id,<<“c31a321ef6d1a091c15bfece55113287/default/default”>>},
{errors,[]},
{changes_left,21607089},
{docs_checked,6000},
{docs_written,6000},
{data_replicated,1162225},
{active_vbreps,32},
{waiting_vbreps,0},
{time_working,399},
{time_committing,0},
{num_checkpoints,0},
{num_failedckpts,0},
{docs_rep_queue,24421},
{size_rep_queue,2791713}]]},
{memory,[{total,294963176},
{processes,174874312},
{processes_used,171989272},
{system,120088864},
{atom,1502673},
{atom_used,1495115},
{binary,28432056},
{code,15041863},
{ets,64090848}]},
{system_memory_data,[{system_total_memory,20962406400},
{free_swap,4294963200},
{total_swap,4294963200},
{cached_memory,1529925632},
{buffered_memory,145485824},
{free_memory,3445846016},
{total_memory,20962406400}]},
{node_storage_conf,[{db_path,"/opt/couchbase/var/lib/couchbase/data"},
{index_path,"/opt/couchbase/var/lib/couchbase/data"}]},
{statistics,[{wall_clock,{52925934918,5000}},
{context_switches,{163817818663,0}},
{garbage_collection,{13267299552,91799678196518,0}},
{io,{{input,4786443488287},{output,13158370499012}}},
{reductions,{30658253554678,145473}},
{run_queue,0},
{runtime,{5026268150,80}}]},
{system_stats,[{cpu_utilization_rate,0.5},
{swap_total,4294963200},
{swap_used,0}]},
{interesting_stats,[]},
{cluster_compatibility_version,131072},
{version,[{public_key,“0.13”},
{lhttpc,“1.3.0”},
{ale,“8cffe61”},
{os_mon,“2.2.7”},
{couch_set_view,“1.2.0a-8352437-git”},
{mnesia,“4.5”},
{inets,“5.7.1”},
{couch,“1.2.0a-8352437-git”},
{mapreduce,“1.0.0”},
{couch_index_merger,“1.2.0a-8352437-git”},
{kernel,“2.14.5”},
{crypto,“2.0.4”},
{ssl,“4.1.6”},
{sasl,“2.1.10”},
{couch_view_parser,“1.0.0”},
{ns_server,“2.0.1-170-rel-community”},
{mochiweb,“1.4.1”},
{oauth,“7d85d3ef”},
{stdlib,“1.17.5”}]},
{supported_compat_version,[2,0]},
{system_arch,“x86_64-unknown-linux-gnu”},
{wall_clock,52925934},
{memory_data,{20962406400,20629684224,{<15470.7605.0>,13313704}}},
{disk_data,[{"/",21545540,16},
{"/dev/shm",10235548,0},
{"/boot",99150,60},
{"/opt",412710064,24}]},
{meminfo,<<“MemTotal: 20471100 kB\nMemFree: 3365084 kB\nBuffers: 142076 kB\nCached: 1494068 kB\nSwapCached: 0 kB\nActive: 13488072 kB\nInactive: 2527744 kB\nActive(anon): 12773140 kB\nInactive(anon): 1647564 kB\nActive(file): 714932 kB\nInactive(file): 880180 kB\nUnevictable: 767544 kB\nMlocked: 0 kB\nSwapTotal: 4194300 kB\nSwapFree: 4194300 kB\nDirty: 280 kB\nWriteback: 0 kB\nAnonPages: 15147212 kB\nMapped: 44608 kB\nShmem: 168 kB\nSlab: 116148 kB\nSReclaimable: 87212 kB\nSUnreclaim: 28936 kB\nKernelStack: 1488 kB\nPageTables: 42460 kB\nNFS_Unstable: 0 kB\nBounce: 0 kB\nWritebackTmp: 0 kB\nCommitLimit: 14429848 kB\nCommitted_AS: 21044432 kB\nVmallocTotal: 34359738367 kB\nVmallocUsed: 197232 kB\nVmallocChunk: 34359537336 kB\nHardwareCorrupted: 0 kB\nAnonHugePages: 1622016 kB\nHugePages_Total: 0\nHugePages_Free: 0\nHugePages_Rsvd: 0\nHugePages_Surp: 0\nHugepagesize: 2048 kB\nDirectMap4k: 10240 kB\nDirectMap2M: 20961280 kB\n”>>}]