Rebalance fails - "no dcp backfill stats"

jbarton · January 26, 2017, 1:51pm

Afternoon,

We have just attempted to add a 6th node to our existing Couchbase cluster:

Cluster
5 server nodes
5 buckets (single replica)
~100 opsecs on 3 of these buckets
Version: 3.0.1 Community Edition (build-1444)

with mixed results. The first time we attempted the rebalance (at 00:05) it loaded both the ‘default’ and ‘cache’ buckets (stated in logs) but then immediately failed 2 seconds later:

Bucket “cache” loaded on node ‘ns_1@node6.mydomain.com’ in 0 seconds.
Bucket “default” loaded on node ‘ns_1@node6.mydomain.com’ in 0 seconds.
Bucket “default” rebalance does not seem to be swap rebalance

Followed by:

<0.15466.6837> exited with {unexpected_exit,
{‘EXIT’,<0.26009.6833>,
{dcp_wait_for_data_move_failed,“default”,
215,
‘ns_1@node1.mydomain.com’,
[‘ns_1@node3.mydomain.com’,
‘ns_1@node6.mydomain.com’],
{error,no_stats_for_this_vbucket}}}}

Rebalance exited with reason {unexpected_exit,
{‘EXIT’,<0.26009.6833>,
{dcp_wait_for_data_move_failed,“default”,215,
‘ns_1@node1.mydomain.com’,
[‘ns_1@node3.mydomain.com’,
‘ns_1@node6.mydomain.com’],
{error,no_stats_for_this_vbucket}}}}

Further investigation into the error.log on node1 shows further information:

[> ns_server:error,2017-01-26T0:05:52.296,ns_1@node1.mydomain.com:<0.3538.6839>:dcp_replicator:wait_for_data_move_loop:134]No dcp backfill stats for bucket “default”, partition 215, connection “replication:ns_1@node1.mydomain.com->ns_1@node3.mydomain.com:default”

[ns_server:error,2017-01-26T0:05:52.299,ns_1@node1.mydomain.com:<0.15466.6837>:ns_single_vbucket_mover:spawn_and_wait:129]Got unexpected exit signal {‘EXIT’,<0.26009.6833>,
[ns_server:error,2017-01-26T0:05:52.300,ns_1@node1.mydomain.com:<0.15466.6837>:misc:sync_shutdown_many_i_am_trapping_exits:1434]Shutdown of the following failed: [{<0.26009.6833>,
[ns_server:error,2017-01-26T0:05:52.300,ns_1@node1.mydomain.com:<0.15466.6837>:misc:try_with_maybe_ignorant_after:1470]Eating exception from ignorant after-block:
[rebalance:error,2017-01-26T0:05:52.431,ns_1@node1.mydomain.com:<0.22374.6838>:ns_vbucket_mover:handle_info:203]<0.15466.6837> exited with {unexpected_exit,
[ns_server:error,2017-01-26T0:05:52.434,ns_1@node1.mydomain.com:<0.1489.6816>:ns_single_vbucket_mover:spawn_and_wait:129]Got unexpected exit signal {‘EXIT’,<0.22374.6838>,
[ns_server:error,2017-01-26T0:05:52.435,ns_1@node1.mydomain.com:<0.4179.6826>:ns_single_vbucket_mover:spawn_and_wait:129]Got unexpected exit signal {‘EXIT’,<0.22374.6838>,
[ns_server:error,2017-01-26T0:05:52.436,ns_1@node1.mydomain.com:<0.25831.6827>:ns_single_vbucket_mover:spawn_and_wait:129]Got unexpected exit signal {‘EXIT’,<0.22374.6838>,
[ns_server:error,2017-01-26T0:05:52.436,ns_1@node1.mydomain.com:<0.27215.6838>:ns_single_vbucket_mover:spawn_and_wait:129]Got unexpected exit signal {‘EXIT’,<0.22374.6838>,
[ns_server:error,2017-01-26T0:05:52.436,ns_1@node1.mydomain.com:<0.21901.6834>:ns_single_vbucket_mover:spawn_and_wait:129]Got unexpected exit signal {‘EXIT’,<0.22374.6838>,
[ns_server:error,2017-01-26T0:05:52.436,ns_1@node1.mydomain.com:<0.16343.6835>:ns_single_vbucket_mover:spawn_and_wait:129]Got unexpected exit signal {‘EXIT’,<0.22374.6838>,
[ns_server:error,2017-01-26T0:05:52.435,ns_1@node1.mydomain.com:<0.27790.6835>:ns_single_vbucket_mover:spawn_and_wait:129]Got unexpected exit signal {‘EXIT’,<0.22374.6838>,

We attempted the rebalance another 5 times (compacting the default bucket in between) which kept generating the same error (with the same partition - 215) as above. However, on the last attempt, the following was recorded in the UI:

Updated bucket default (of type membase) properties:
[{num_replicas,1},
{ram_quota,8912896000},
{auth_type,sasl},
{autocompaction,false},
{purge_interval,undefined},
{flush_enabled,true},
{num_threads,3},
{eviction_policy,value_only}]

At which point the UI reported the ‘default’ (and another) bucket is now run by all 6 nodes. However, our other 3 buckets are still running on 5 nodes.

Any suggestions as to the cause of the “No dcp backfill stats for bucket” error? The UI stats a rebalance is still required - should we attempt another to transfer data from the other 3 buckets?

drigby · January 26, 2017, 2:03pm

You can probably just retry the rebalance - it’ll pick up from where it left off.

Note that 3.0.1 is a pretty old release now - even on CE - it was released in Oct 2014. I’d strongly recommend upgrading to at least 3.1.3, and ideally 4.5.0. There’s been many issues with rebalance fixed, and many other improvements in general.

jbarton · January 26, 2017, 2:18pm

Thanks, I’ll retry the rebalance again, and let you know how it goes.
I’m aware of all the fixes in 4.5.0, and have just recently (last week) pushed this version into our dev environment.

jbarton · January 31, 2017, 10:12am

Morning!
I attempted the rebalance Friday evening, and unfortunately I had the same problem. It started (at 20%), then failed within a couple of seconds.

It was generating very similar errors as before:

Rebalance exited with reason {unexpected_exit,
{‘EXIT’,<0.18175.7072>,
{dcp_wait_for_data_move_failed,“default”,559,
‘ns_1@client11.edigitalresearch.com’,
[‘ns_1@client12.edigitalresearch.com’,
‘ns_1@db22a.edigitalresearch.com’],
{error,no_stats_for_this_vbucket}}}}

with the same sort of errors in the error.log:

[ns_server:error,2017-01-27T23:23:00.081,ns_1@node1.mydomain.com:<0.25811.7069>:dcp_replicator:wait_for_data_move_loop:134]No dcp backfill stats for bucket “default”, partition 215, connection “replication:ns_1@node1.mydomain.com->ns_1@node3.mydomain.com:default”
[ns_server:error,2017-01-27T23:23:00.083,ns_1@node1.mydomain.com:<0.19462.7072>:ns_single_vbucket_mover:spawn_and_wait:129]Got unexpected exit signal {‘EXIT’,<0.14430.7073>,
{dcp_wait_for_data_move_failed,“default”,215,
‘ns_1@node1.mydomain.com’,
[‘ns_1@node3.mydomain.com’,
‘ns_1@db22a.mydomain.com’],
{error,no_stats_for_this_vbucket}}}
[ns_server:error,2017-01-27T23:23:00.083,ns_1@node1.mydomain.com:<0.19462.7072>:misc:sync_shutdown_many_i_am_trapping_exits:1434]Shutdown of the following failed: [{<0.14430.7073>,
{dcp_wait_for_data_move_failed,“default”,
215,
‘ns_1@node1.mydomain.com’,
[‘ns_1@node3.mydomain.com’,
‘ns_1@db22a.mydomain.com’],
{error,no_stats_for_this_vbucket}}}]

At this point, we’re unable to complete the rebalance across the remaining buckets, so an upgrade of this environment is unlikely to work. I’ve actually just come across a very similar error in the couchbase tracker:

https://issues.couchbase.com/browse/MB-22082

So any advice would be appreciated. In the meantime, I’m building a new cluster running 4.5.

mfurlong · March 6, 2019, 4:45am

Is there anymore information on this? I am on CE 4.5.1 and am adding new nodes on CE 5.0.1. While running the rebalance I keep getting ‘error,no_stats_for_this_vbucket’. I’ve tried multiple times without success, any suggestions?

TIA
Mark

Topic		Replies	Views
Rebalance exited with reason: mover_crashed Couchbase Server rebalance	1	370	November 21, 2023
Failure Recovery - Can't Rebalance Couchbase Server	6	3697	November 9, 2014
Unable to rebalance cluster after node failure Couchbase Server	3	2108	July 29, 2013
Rebalancing fails 2.5.1. Windows azure Couchbase Server	14	2797	January 29, 2015
Trying to recover from an outage, rebalancing fails immediately Couchbase Server	3	287	September 10, 2023

Rebalance fails - "no dcp backfill stats"

Related topics