Rebalance exited with reason {{badmatch,failed} - rebalancer

Hi,

I have a cluster of few couchbase servers. One of the servers had reported a time drift. The time drift was fixed but after a few hours the server become unresponsive. After a service restart when we are trying to rebalance we are getting this error:

Rebalance exited with reason {{badmatch,failed},
[{ns_rebalancer,rebalance_body,7,
[{file,"src/ns_rebalancer.erl"},
{line,500}]},
{async,'-async_init/4-fun-1-',3,
[{file,"src/async.erl"},{line,199}]}]}.
Rebalance Operation Id = 9141d98b8c73600ffbe28c80cb4e20a4

on the server in the indexers.log file we are seeing these messages:

2025-01-29T23:25:27.887+00:00 [Info] DDLServiceMgr checking create token progress
2025-01-29T23:25:56.907+00:00 [Info] RebalanceServiceManager::GetCurrentTopology []
2025-01-29T23:25:56.907+00:00 [Info] GenericServiceManager::GetTaskList: return from rev [0 0 0 0 0 0 0 0] call. err: operation_canceled
2025-01-29T23:25:56.907+00:00 [Info] GenericServiceManager::GetTaskList: called with rev: []
2025-01-29T23:25:56.907+00:00 [Info] GenericServiceManager::GetTaskList: return from rev [] call. taskList: &{Rev:[0 0 0 0 0 0 0 0] Tasks:[]}
2025-01-29T23:25:56.907+00:00 [Info] RebalanceServiceManager::GetCurrentTopology returns &{Rev:[0 0 0 0 0 0 0 0] Nodes:[efa02cc4c03dc7ae56a66083600927bc] IsBalanced:true Messages:[]}
2025-01-29T23:25:56.907+00:00 [Info] GenericServiceManager::GetTaskList: called with rev: [0 0 0 0 0 0 0 0]
2025-01-29T23:25:56.907+00:00 [Info] RebalanceServiceManager::GetCurrentTopology [0 0 0 0 0 0 0 0]
2025-01-29T23:25:56.987+00:00 [Info] RebalanceServiceManager::rebalanceJanitor Running Periodic Cleanup
2025-01-29T23:26:26.097+00:00 [Info] fragAutoTuner: FragRatio at 100. MaxFragRatio 100, MaxBandwidth 0. BandwidthUsage 0. AvailDisk 0. TotalUsed 0. BandwidthRatio 1. UsedSpaceRatio 1. CleanerBandwidth 9223372036854775807. Duration 0.
2025-01-29T23:26:26.908+00:00 [Info] GenericServiceManager::GetTaskList: return from rev [0 0 0 0 0 0 0 0] call. err: operation_canceled
2025-01-29T23:26:26.908+00:00 [Info] GenericServiceManager::GetTaskList: called with rev: []
2025-01-29T23:26:26.908+00:00 [Info] GenericServiceManager::GetTaskList: return from rev [] call. taskList: &{Rev:[0 0 0 0 0 0 0 0] Tasks:[]}
2025-01-29T23:26:26.908+00:00 [Info] RebalanceServiceManager::GetCurrentTopology []
2025-01-29T23:26:26.908+00:00 [Info] RebalanceServiceManager::GetCurrentTopology returns &{Rev:[0 0 0 0 0 0 0 0] Nodes:[efa02cc4c03dc7ae56a66083600927bc] IsBalanced:true Messages:[]}

What we see from these logs is that this server doesn’t see the entire cluster topology.

The server have the firewall stopped and in the UI I can see the server in green.

I also retried restarting all the couchbase serives on the rest of the nodes and also rebooted all the servers. The error persists/

What version?
Do you have analytics? If so, be sure to kill the analytics processes.

The server is 7.6.3 and we don’t run any analytics

Also tried to remove the node from the cluster, and add it back, when I try to run the rebalance I am getting the same error

Rebalance exited with reason 
{{badmatch,failed},
[{ns_rebalancer,rebalance_body,7,
[{file,\"src/ns_rebalancer.erl\"}, {line,500}]},
{async,'-async_init/4-fun-1-',3,
[{file,\"src/async.erl\"},{line,199}]}]}.

in the idexer.log I don’t see any messages added anymore

Hi @flaviu - do you have a support contract? I think this needs cbcollect_info bundle for troubleshooting.

  • Mike

We don’t have a support contract. This setup is a test setup for a potential customer. but the customer is very worried that Couchbase is not stable and can’t recover by itself dispite the fact that I told them the opposite. And actually this is the first time in 10 years when a cluster is not recovering. @mreiche can you help debug this?

I’ll figure out what we can do.

Follow the instructions to upload your logs.

I am trying to collect the logs ( I think some were already uploaded, for other nodes getting this error)

RuntimeError: File size too large, try using force_zip64

seeing these logs on one of the server indexer.log file

2024-12-01T15:06:27.465+00:00 [Info] StreamState::connection error - set repair state to RESTART_VB for MAINT_STREAM keyspaceId bucket vb 994
2024-12-01T15:06:27.465+00:00 [Info] StreamState::connection error - set repair state to RESTART_VB for MAINT_STREAM keyspaceId bucket vb 366
2024-12-01T15:06:27.465+00:00 [Info] StreamState::connection error - set repair state to RESTART_VB for MAINT_STREAM keyspaceId bucket vb 528
2024-12-01T15:06:27.465+00:00 [Info] StreamState::connection error - set repair state to RESTART_VB for MAINT_STREAM keyspaceId bucket vb 996
2024-12-01T15:06:27.465+00:00 [Info] StreamState::connection error - set repair state to RESTART_VB for MAINT_STREAM keyspaceId bucket vb 833
2024-12-01T15:06:27.465+00:00 [Info] StreamState::connection error - set repair state to RESTART_VB for MAINT_STREAM keyspaceId bucket vb 943
2024-12-01T15:06:27.465+00:00 [Info] StreamState::connection error - set repair state to RESTART_VB for MAINT_STREAM keyspaceId bucket vb 967
2024-12-01T15:06:27.465+00:00 [Info] StreamState::connection error - set repair state to RESTART_VB for MAINT_STREAM keyspaceId bucket vb 541
2024-12-01T15:06:27.465+00:00 [Info] StreamState::connection error - set repair state to RESTART_VB for MAINT_STREAM keyspaceId bucket vb 464
2024-12-01T15:06:27.465+00:00 [Info] Timekeeper::handleStreamConnErr Stream repair is already in progress for stream: MAINT_STREAM, keyspaceId: bucket

@mreiche any suggestions for what should I look for?

Those messages are from December.

You could move the logs that are there, then start the cluster, do whatever operation demonstrates the problem, then upload the logs. And provide where they were uploaded to.