Rebalance exited with reason {{badmatch,failed} - rebalancer

flaviu · January 29, 2025, 11:35pm

Hi,

I have a cluster of few couchbase servers. One of the servers had reported a time drift. The time drift was fixed but after a few hours the server become unresponsive. After a service restart when we are trying to rebalance we are getting this error:

Rebalance exited with reason {{badmatch,failed},
[{ns_rebalancer,rebalance_body,7,
[{file,"src/ns_rebalancer.erl"},
{line,500}]},
{async,'-async_init/4-fun-1-',3,
[{file,"src/async.erl"},{line,199}]}]}.
Rebalance Operation Id = 9141d98b8c73600ffbe28c80cb4e20a4

on the server in the indexers.log file we are seeing these messages:

2025-01-29T23:25:27.887+00:00 [Info] DDLServiceMgr checking create token progress
2025-01-29T23:25:56.907+00:00 [Info] RebalanceServiceManager::GetCurrentTopology []
2025-01-29T23:25:56.907+00:00 [Info] GenericServiceManager::GetTaskList: return from rev [0 0 0 0 0 0 0 0] call. err: operation_canceled
2025-01-29T23:25:56.907+00:00 [Info] GenericServiceManager::GetTaskList: called with rev: []
2025-01-29T23:25:56.907+00:00 [Info] GenericServiceManager::GetTaskList: return from rev [] call. taskList: &{Rev:[0 0 0 0 0 0 0 0] Tasks:[]}
2025-01-29T23:25:56.907+00:00 [Info] RebalanceServiceManager::GetCurrentTopology returns &{Rev:[0 0 0 0 0 0 0 0] Nodes:[efa02cc4c03dc7ae56a66083600927bc] IsBalanced:true Messages:[]}
2025-01-29T23:25:56.907+00:00 [Info] GenericServiceManager::GetTaskList: called with rev: [0 0 0 0 0 0 0 0]
2025-01-29T23:25:56.907+00:00 [Info] RebalanceServiceManager::GetCurrentTopology [0 0 0 0 0 0 0 0]
2025-01-29T23:25:56.987+00:00 [Info] RebalanceServiceManager::rebalanceJanitor Running Periodic Cleanup
2025-01-29T23:26:26.097+00:00 [Info] fragAutoTuner: FragRatio at 100. MaxFragRatio 100, MaxBandwidth 0. BandwidthUsage 0. AvailDisk 0. TotalUsed 0. BandwidthRatio 1. UsedSpaceRatio 1. CleanerBandwidth 9223372036854775807. Duration 0.
2025-01-29T23:26:26.908+00:00 [Info] GenericServiceManager::GetTaskList: return from rev [0 0 0 0 0 0 0 0] call. err: operation_canceled
2025-01-29T23:26:26.908+00:00 [Info] GenericServiceManager::GetTaskList: called with rev: []
2025-01-29T23:26:26.908+00:00 [Info] GenericServiceManager::GetTaskList: return from rev [] call. taskList: &{Rev:[0 0 0 0 0 0 0 0] Tasks:[]}
2025-01-29T23:26:26.908+00:00 [Info] RebalanceServiceManager::GetCurrentTopology []
2025-01-29T23:26:26.908+00:00 [Info] RebalanceServiceManager::GetCurrentTopology returns &{Rev:[0 0 0 0 0 0 0 0] Nodes:[efa02cc4c03dc7ae56a66083600927bc] IsBalanced:true Messages:[]}

What we see from these logs is that this server doesn’t see the entire cluster topology.

The server have the firewall stopped and in the UI I can see the server in green.

I also retried restarting all the couchbase serives on the rest of the nodes and also rebooted all the servers. The error persists/

mreiche · January 30, 2025, 12:53am

What version?
Do you have analytics? If so, be sure to kill the analytics processes.

flaviu · January 30, 2025, 1:44am

The server is 7.6.3 and we don’t run any analytics

Also tried to remove the node from the cluster, and add it back, when I try to run the rebalance I am getting the same error

Rebalance exited with reason 
{{badmatch,failed},
[{ns_rebalancer,rebalance_body,7,
[{file,\"src/ns_rebalancer.erl\"}, {line,500}]},
{async,'-async_init/4-fun-1-',3,
[{file,\"src/async.erl\"},{line,199}]}]}.

in the idexer.log I don’t see any messages added anymore

mreiche · January 30, 2025, 8:02pm

Hi @flaviu - do you have a support contract? I think this needs cbcollect_info bundle for troubleshooting.

Mike

flaviu · January 31, 2025, 10:21am

We don’t have a support contract. This setup is a test setup for a potential customer. but the customer is very worried that Couchbase is not stable and can’t recover by itself dispite the fact that I told them the opposite. And actually this is the first time in 10 years when a cluster is not recovering. @mreiche can you help debug this?

mreiche · February 2, 2025, 4:02pm

I’ll figure out what we can do.

mreiche · February 3, 2025, 6:22pm

Follow the instructions to upload your logs.

flaviu · February 4, 2025, 9:14am

I am trying to collect the logs ( I think some were already uploaded, for other nodes getting this error)

RuntimeError: File size too large, try using force_zip64

flaviu · February 6, 2025, 10:05am

seeing these logs on one of the server indexer.log file

2024-12-01T15:06:27.465+00:00 [Info] StreamState::connection error - set repair state to RESTART_VB for MAINT_STREAM keyspaceId bucket vb 994
2024-12-01T15:06:27.465+00:00 [Info] StreamState::connection error - set repair state to RESTART_VB for MAINT_STREAM keyspaceId bucket vb 366
2024-12-01T15:06:27.465+00:00 [Info] StreamState::connection error - set repair state to RESTART_VB for MAINT_STREAM keyspaceId bucket vb 528
2024-12-01T15:06:27.465+00:00 [Info] StreamState::connection error - set repair state to RESTART_VB for MAINT_STREAM keyspaceId bucket vb 996
2024-12-01T15:06:27.465+00:00 [Info] StreamState::connection error - set repair state to RESTART_VB for MAINT_STREAM keyspaceId bucket vb 833
2024-12-01T15:06:27.465+00:00 [Info] StreamState::connection error - set repair state to RESTART_VB for MAINT_STREAM keyspaceId bucket vb 943
2024-12-01T15:06:27.465+00:00 [Info] StreamState::connection error - set repair state to RESTART_VB for MAINT_STREAM keyspaceId bucket vb 967
2024-12-01T15:06:27.465+00:00 [Info] StreamState::connection error - set repair state to RESTART_VB for MAINT_STREAM keyspaceId bucket vb 541
2024-12-01T15:06:27.465+00:00 [Info] StreamState::connection error - set repair state to RESTART_VB for MAINT_STREAM keyspaceId bucket vb 464
2024-12-01T15:06:27.465+00:00 [Info] Timekeeper::handleStreamConnErr Stream repair is already in progress for stream: MAINT_STREAM, keyspaceId: bucket

flaviu · February 12, 2025, 6:47pm

@mreiche any suggestions for what should I look for?

mreiche · February 12, 2025, 7:10pm

Those messages are from December.

You could move the logs that are there, then start the cluster, do whatever operation demonstrates the problem, then upload the logs. And provide where they were uploaded to.

flaviu · April 7, 2025, 9:22pm

just a quick update. after a Couchbase software update the rebalance worked flawless and everything went back to normal

Topic		Replies	Views
Rebalance failed Couchbase Server	2	781	September 6, 2023
Failure during rebalance due to badmatch? Couchbase Server	3	3579	July 11, 2013
Problems with swap rebalance Couchbase Server	3	2639	November 14, 2013
Error message while rebalancing Couchbase Server	0	1844	September 22, 2014
Re-balance failing with error - Rebalance exited with reason {badmatch,failed} Couchbase Server server	1	8014	February 13, 2018

Rebalance exited with reason {{badmatch,failed} - rebalancer

Related topics