Rebalance fails - "no dcp backfill stats"

Afternoon,

We have just attempted to add a 6th node to our existing Couchbase cluster:

Cluster
5 server nodes
5 buckets (single replica)
~100 opsecs on 3 of these buckets
Version: 3.0.1 Community Edition (build-1444)

with mixed results. The first time we attempted the rebalance (at 00:05) it loaded both the ‘default’ and ‘cache’ buckets (stated in logs) but then immediately failed 2 seconds later:

  • Bucket “cache” loaded on node ‘ns_1@node6.mydomain.com’ in 0 seconds.
  • Bucket “default” loaded on node ‘ns_1@node6.mydomain.com’ in 0 seconds.
  • Bucket “default” rebalance does not seem to be swap rebalance

Followed by:

<0.15466.6837> exited with {unexpected_exit,
{‘EXIT’,<0.26009.6833>,
{dcp_wait_for_data_move_failed,“default”,
215,
‘ns_1@node1.mydomain.com’,
[‘ns_1@node3.mydomain.com’,
‘ns_1@node6.mydomain.com’],
{error,no_stats_for_this_vbucket}}}}

Rebalance exited with reason {unexpected_exit,
{‘EXIT’,<0.26009.6833>,
{dcp_wait_for_data_move_failed,“default”,215,
‘ns_1@node1.mydomain.com’,
[‘ns_1@node3.mydomain.com’,
‘ns_1@node6.mydomain.com’],
{error,no_stats_for_this_vbucket}}}}

Further investigation into the error.log on node1 shows further information:

[> ns_server:error,2017-01-26T0:05:52.296,ns_1@node1.mydomain.com:<0.3538.6839>:dcp_replicator:wait_for_data_move_loop:134]No dcp backfill stats for bucket “default”, partition 215, connection “replication:ns_1@node1.mydomain.com->ns_1@node3.mydomain.com:default”

[ns_server:error,2017-01-26T0:05:52.299,ns_1@node1.mydomain.com:<0.15466.6837>:ns_single_vbucket_mover:spawn_and_wait:129]Got unexpected exit signal {‘EXIT’,<0.26009.6833>,
[ns_server:error,2017-01-26T0:05:52.300,ns_1@node1.mydomain.com:<0.15466.6837>:misc:sync_shutdown_many_i_am_trapping_exits:1434]Shutdown of the following failed: [{<0.26009.6833>,
[ns_server:error,2017-01-26T0:05:52.300,ns_1@node1.mydomain.com:<0.15466.6837>:misc:try_with_maybe_ignorant_after:1470]Eating exception from ignorant after-block:
[rebalance:error,2017-01-26T0:05:52.431,ns_1@node1.mydomain.com:<0.22374.6838>:ns_vbucket_mover:handle_info:203]<0.15466.6837> exited with {unexpected_exit,
[ns_server:error,2017-01-26T0:05:52.434,ns_1@node1.mydomain.com:<0.1489.6816>:ns_single_vbucket_mover:spawn_and_wait:129]Got unexpected exit signal {‘EXIT’,<0.22374.6838>,
[ns_server:error,2017-01-26T0:05:52.435,ns_1@node1.mydomain.com:<0.4179.6826>:ns_single_vbucket_mover:spawn_and_wait:129]Got unexpected exit signal {‘EXIT’,<0.22374.6838>,
[ns_server:error,2017-01-26T0:05:52.436,ns_1@node1.mydomain.com:<0.25831.6827>:ns_single_vbucket_mover:spawn_and_wait:129]Got unexpected exit signal {‘EXIT’,<0.22374.6838>,
[ns_server:error,2017-01-26T0:05:52.436,ns_1@node1.mydomain.com:<0.27215.6838>:ns_single_vbucket_mover:spawn_and_wait:129]Got unexpected exit signal {‘EXIT’,<0.22374.6838>,
[ns_server:error,2017-01-26T0:05:52.436,ns_1@node1.mydomain.com:<0.21901.6834>:ns_single_vbucket_mover:spawn_and_wait:129]Got unexpected exit signal {‘EXIT’,<0.22374.6838>,
[ns_server:error,2017-01-26T0:05:52.436,ns_1@node1.mydomain.com:<0.16343.6835>:ns_single_vbucket_mover:spawn_and_wait:129]Got unexpected exit signal {‘EXIT’,<0.22374.6838>,
[ns_server:error,2017-01-26T0:05:52.435,ns_1@node1.mydomain.com:<0.27790.6835>:ns_single_vbucket_mover:spawn_and_wait:129]Got unexpected exit signal {‘EXIT’,<0.22374.6838>,

We attempted the rebalance another 5 times (compacting the default bucket in between) which kept generating the same error (with the same partition - 215) as above. However, on the last attempt, the following was recorded in the UI:

Updated bucket default (of type membase) properties:
[{num_replicas,1},
{ram_quota,8912896000},
{auth_type,sasl},
{autocompaction,false},
{purge_interval,undefined},
{flush_enabled,true},
{num_threads,3},
{eviction_policy,value_only}]

At which point the UI reported the ‘default’ (and another) bucket is now run by all 6 nodes. However, our other 3 buckets are still running on 5 nodes.

Any suggestions as to the cause of the “No dcp backfill stats for bucket” error? The UI stats a rebalance is still required - should we attempt another to transfer data from the other 3 buckets?

You can probably just retry the rebalance - it’ll pick up from where it left off.

Note that 3.0.1 is a pretty old release now - even on CE - it was released in Oct 2014. I’d strongly recommend upgrading to at least 3.1.3, and ideally 4.5.0. There’s been many issues with rebalance fixed, and many other improvements in general.

Thanks, I’ll retry the rebalance again, and let you know how it goes.
I’m aware of all the fixes in 4.5.0, and have just recently (last week) pushed this version into our dev environment.

Morning!
I attempted the rebalance Friday evening, and unfortunately I had the same problem. It started (at 20%), then failed within a couple of seconds.

It was generating very similar errors as before:

Rebalance exited with reason {unexpected_exit,
{‘EXIT’,<0.18175.7072>,
{dcp_wait_for_data_move_failed,“default”,559,
‘ns_1@client11.edigitalresearch.com’,
[‘ns_1@client12.edigitalresearch.com’,
‘ns_1@db22a.edigitalresearch.com’],
{error,no_stats_for_this_vbucket}}}}

with the same sort of errors in the error.log:

[ns_server:error,2017-01-27T23:23:00.081,ns_1@node1.mydomain.com:<0.25811.7069>:dcp_replicator:wait_for_data_move_loop:134]No dcp backfill stats for bucket “default”, partition 215, connection “replication:ns_1@node1.mydomain.com->ns_1@node3.mydomain.com:default”
[ns_server:error,2017-01-27T23:23:00.083,ns_1@node1.mydomain.com:<0.19462.7072>:ns_single_vbucket_mover:spawn_and_wait:129]Got unexpected exit signal {‘EXIT’,<0.14430.7073>,
{dcp_wait_for_data_move_failed,“default”,215,
‘ns_1@node1.mydomain.com’,
[‘ns_1@node3.mydomain.com’,
‘ns_1@db22a.mydomain.com’],
{error,no_stats_for_this_vbucket}}}
[ns_server:error,2017-01-27T23:23:00.083,ns_1@node1.mydomain.com:<0.19462.7072>:misc:sync_shutdown_many_i_am_trapping_exits:1434]Shutdown of the following failed: [{<0.14430.7073>,
{dcp_wait_for_data_move_failed,“default”,
215,
‘ns_1@node1.mydomain.com’,
[‘ns_1@node3.mydomain.com’,
‘ns_1@db22a.mydomain.com’],
{error,no_stats_for_this_vbucket}}}]

At this point, we’re unable to complete the rebalance across the remaining buckets, so an upgrade of this environment is unlikely to work. I’ve actually just come across a very similar error in the couchbase tracker:

https://issues.couchbase.com/browse/MB-22082

So any advice would be appreciated. In the meantime, I’m building a new cluster running 4.5.

Is there anymore information on this? I am on CE 4.5.1 and am adding new nodes on CE 5.0.1. While running the rebalance I keep getting ‘error,no_stats_for_this_vbucket’. I’ve tried multiple times without success, any suggestions?

TIA
Mark