Rebalance is stuck

Setting is 2 nodes in a cluster
There is no error message.
After adding the second node to the cluster and hitting “Rebalance”, first node is stuck on 11.1% while the other on 33.3%.
Stopping and resuming rebalance doesn’t help
Stopping the restarting the second node doesn’t help

There is a message in the Logs: “Bucket “xxx” rebalance appears to be swap rebalance”

What is “swap rebalance” and is it related ?

Thanks
Itay

I just had a similar issue a few days back - in my case a vBucket had become corrupted.
Since node 2 is stuck at 33.3% i assume you have 3 active buckets.
Go to server node screen, open the details and look at what bucket is causing the stall.
Check after start / stopping that it stalls at the same place. ( sometime you can get a few vBuckets for each start & stop, until you hit the bad one)
If you indeed are stuck, then you can try what i had to do:

  1. use CBBackup and do a full backup
  2. delete the bucket with the corrupted file
  3. restore that bucket from backup

After that it should work fine - if you have version 3.0.2, i would suggest to install 3.0.3 over node 1 first so you get a working CBBackup that doesnt stall.

Hi @Lennartos,

Thanks for your answer.
Indeed I have 3 buckets so I guess that one is stuck.
After a day the percentage rose to 33.2%

I still cannot balance, add a new node.
I tried removing replica’s, hoping it might help but it didn’t.

I cannot believe that your solution is the best one. Corruption of data is really unacceptable.
Deleting a bucket has significant consequences. The first one is down time while the other are loss of data, regardless of any existence of backup.

BTW, indexing is also stuck for some views.

@cihangirb , please advise.

Itay

You have exactly the same behavior as I had.
Indexing being stuck is the last nail in the coffin – you wont be able to use the bucket anymore for anything useful.
Any query will return unfinished data, and “stale = false” will return with errors.

After wasting 3 full days on this a few weeks back I found out that about 1/3 of the files where inaccessible to couchbase ( no replication or indexation past that point – nothing I did could get cocuhbase past that point where vbucket is corrupted )
However with the backup / restore path I wrote earlier I did manage to save all files but one.

What version of couchbase are you running btw?
Would be horrible if this issue still persists in 3.0.3

I found a work-around around stuck indexing. It worked twice already.

All you need to do if modify any view within the design doc just by adding a white space somewhere in the JSON, saving and selecting “Show Results” on full set. Voila, view is being indexed successfully.

Cannot tell if is always work.
Rebalancing is still not working so I cannot add nodes.

P.S. I’m using 3.0.1 community server

Swap rebalance refers to the case where you add and remove the same number of nodes in a single rebalance. In that case the optimization is that we carry only the vbuckets that existed on the removed nodes to the nodes being added and other nodes don’t have a change in their vbucket layout.

There are a number of issues fixed in 3.0.3 around the rebalance of views. Loading... this looks like MB-12931 that has been fixed in 3.0.3.

@itay, I know you are on community any chance you could try this on 3.0.3?
thanks
-cihan

I did the same on 3.0.2 - but later on noticed that some views where missing results that i should have had.
Upgrading to 3.0.3 and restarting bucket from backup did the trick for me. ( just upgrading did not help )
It might have been enough to once again work around it at that point if the fix was indeed in 3.0.3 like @cihangirb suggests - at least its worth a shot.

Thanks guys,
@cihangirb, I never upgraded the nodes. Can you please share a link about best practices of upgrading a running cluster, with minimum downtime and overhead.

sure thing. you can do this completely online:
http://docs.couchbase.com/admin/admin/Install/upgrading.html

Thanks.
When thinking about it, I failed to see how upgrading the servers can help me :weary:

If I understand correctly, I should remove the second node, the one which I fail to add using rebalance, upgrade it, add it and then rebalance again.

But I can’t rebalance ! This is the problem.
Something probably is corrupted with the first node and this is why I cannot rebalance.
I cannot even back it up completely as cbbackup also gets stuck in the middle of a specific bucket.

The only bad solution I can think about is to remove the only good node (the first one), creating a downtime for maintenance for few hours, upgrade it and pray that when it will wake up, all the data will be saved.

But, how can I upgrade the last and only node in a cluster. Obviously I cannot remove it as if I will, there will be no cluster.

@cihangirb, what is going on here ?
Is there a way to remove only the bad docs that blocks rebalance and backup ?

@cihangirb, please advise
How can I save my data and scale the cluster ?

@cihangirb,
There is no 3.0.3 for community, only 3.0.1. NoSQL Database Download | Couchbase 30-Day Free Trial

@ingenthr, @cihangirb,

Still hanging with a production cluster that I cannot scale nor backup :disappointed_relieved:

@ingenthr, @cihangirb,

Hi Matt and Cihan,
I understand that you are very busy but I have a cluster in production that I cannot scale nor backup and I feel completly vulnerable. In case of another failure, I’ll lose all my data from ~2 weeks ago (and counting).

Please advise quickly on how I can scale (rebalance) or backup (to restore and recover)

I can’t advise any further on a newer CE release-- maybe @cihangirb can-- but I can perhaps recommend that you at least do filesystem level backups. If it’s on something like AWS, an S3 snapshot would be good.

Hi Itay, there isn’t immediate relief coming for the MB-12931 but I can recommend a not-so-great workaround to get you unstuck: is there any way you can XDCR to another cluster? if you do have a large resultsset this may take a while. is this possible?
thanks

@ingenthr,
Do you mean backup the data folders of the remaining node ? I thought about it but if the file structure, or a specific set of files is corrupted, this means that the backup itself will also be corrupted the same way.
Can I “paste” the copied data files into a different cluster and it will work ?
What do you mean by S3 ? The data files are saved locally on the machine, not on S3, right ?
Can you point me to the exact location of the data files ?

@cihangirb,
If I create another cluster and XDCR to it, wouldn’t it get stuck on Rebalance, just as rebalancing to a different node ?

Perhaps there are switching in cbbackup that I can use ?

As far as I understand now:

  1. cbbackup and rebalance both get stuck in the same location, probably due to corrupted data
  2. There is no tool to detect the corrupted data
  3. There is no tool to remove the corrupted data

Perhaps there is a 3rd party tool that you can point me to, that backups differently, e.g., not getting stuck on corrupted docs (skipping them) and then restored the data to a different node ?

Hi Itay
Cbbackup is bugged in 3.0.2 - it doesnt get stuck in 3.0.3
after offline upgrade to 3.0.3 you should be ale to successfully backup.

Have you tried if you can backup the single bucket in question only? As i read the bugreport on it later on that should have worked as well.

I made a bug report on this, and they are looking into it… this is what they requested:
"Can you grab the log using the collect info option. See the link below for details on how to get it through the admin console. You can upload the logs to the bug report.
http://docs.couchbase.com/admin/admin/UI/ui-cluster-wide-info.html
"
Since my system is running after deleting the bucket its probably best if you can provide this with your current corrupted setup, so they might be able to check if it was indeed solved in 3.0.3 or solve it if so it never happens again.

https://issues.couchbase.com/browse/MB-14901

Thanks L,

3.0.3 is EE while I’m using CE 3.0.1 (the latest).
As per your post on 05/19, only upgrading to 3.0.3 was not enough and you had to restart from a backup. I don’t have a valid backup.

I have 3 buckets, A,B and C. A & B backup fine (separately). C is getting stuck.
I did found out that every time I cbbackup, the hanging point is a bit later in the process. It started around 70%, jumped to 90 and now I get 99.5, 99.6, 99.7, 99.8. I thought that maybe on every backup something gets fixed so the next run is getting further. However, after that I’m starting to reach 99.7, 99.6, 99.5 and now 98.4.

This is a very important bucket. I cannot give up on it. I can give up on specific docs and delete them but I need to know which ones.