.NET SDK fails to recover after cluster automatic failover

I’ve been trying to create a wrapper for the buckets that creates a new bucket if the current bucket returns too many timeouts or NodeUnavailable errors, and keeps track of ref counts to buckets that have in-flight requests, but I’ve found that even creating a new bucket after the failover happens doesn’t always cause the timeouts to stop. Also, some requests are hanging indefinitely – over 15 minutes and counting. Even when I stop using that particular IBucket instance. IBucket.Get just doesn’t return sometimes and hangs forever.

@Trevor_Elliott -

Using the code in the repo (thanks!) you provided I haven’t been able to replicate - there is a bit of a hang initially when the node is shutdown, but it keeps going. I have still have a few knobs to turn, so something may still come up.

A couple of things to note:

  • I am using CB 5.5.0 for testing; the other poster indicate that your using 4.1.0 which is fairly old and offiically unsupported as of April 2018. You may want to consider a newer server version.
  • Your using the sychronous API, you may want to consider the async API as its non-blocking.
  • Are you rebalancing after you failover? If not you should be!
  • Your app is configured to make IBucket reference a singleton which is good. You don’t explicitly shut it down via Dispose, however, you probably want to do that. This is usually done via Application_End in Global.asax or by registering a callback in Setup.cs when the app terminates.

I’ll dig a bit deeper and let you know what i find.

Jeff

@jmorris

For my tests I set up a new cluster of 3 couchbase servers running Couchbase Server Community Edition 6.0.0 build 1693.

I can try using the async overloads to see if the bug only happens with the synchronous ones, however being async doesn’t help if requests still hang indefinitely. We want to avoid a production service outage which means outages need to fail fast regardless of thread usage.

The behavior that I’m seeing is intermittent and I have to run the exact same test scenario multiple times in a row sometimes before it occurs. I am running a load test with Loadster using 50 test users, and I think that influences the frequency of the issue occurring. Note that I’m also making sure that I kill a node instantly, I don’t shut down it gracefully, when I am simulating an outage.

The behavior that happens when everything works correctly is that I kill a node, and for the next 15 seconds or so everything hangs. Then a flood of timeout errors appear. Then I start getting NodeUnavailable errors which fail fast. Then for the next 40 seconds or so, until the failover happens, I notice that about 1/3 of my requests fail fast with NodeUnavailable, while the other 2/3 of requests work just fine. Then once the failover happens, all requests start working again even before and without having to rebalance.

The behavior that happens when things don’t work and the bug is reproduced is that I kill a node, and for the next 15 seconds or so everything hangs. Then timeout errors start to trickle in. I never see NodeUnavailable, but i do see some Safe handle has been closed errors, and mostly timeout errors. However, most requests hang indefinitely and never actually fail. The behavior does not change after the fail-over happens, or after rebalance happens.

I have reproduced the issue again today with both behaviors using my test solution, which I added some extra logging information to. Here are the logs I have captured for each test behavior:

GOOD behavior: 22:52 Server error after 15 seconds: OperationTimeout The operation has timed ou - Pastebin.com
BAD behavior: 04:29 Server error after 15 seconds: OperationTimeout The operation has timed ou - Pastebin.com

The benefits of the good behavior are obviously that it starts working again after auto-failover happens, and also that even before that it fails instantly with NodeUnavailable which allows us to bypass the cache and do a hard-pull of the data, which means our service does not have downtime.

@Trevor_Elliott -

Thanks for the detail, it really helps. Can you post the entire stacktrace of the “BAD behaviour” state? The exception that outputted: 07:12 Server error after 20 seconds: OperationTimeout The operation has timed out. for example?

Also, if you can enable SDK level logging and send me the output that would also help.

Jeff

I ran another test, this time using the async overloads in the SDK, with the SDK logging enabled.

Here is my program log: https://pastebin.com/VBBN4a9a

I noticed the errors are a little different now that I’m using the async overloads. And even the successful calls sometimes take much longer. But the async callbacks are still taking almost 15 minutes to finish some requests. Note that in this test, the app did start working as normal after about 6-8 minutes, after failover and rebalance had been completed. This is in contrast to when I was using the synchronous methods where the app seemed permanently unresponsive.

The exception stack trace is here: https://pastebin.com/enhePP5e

@Trevor_Elliott -

That’s not exactly what I expected to see:

  • I need the logs (DEBUG level) from internally within the SDK itself, you can follow these directions to enable them: Logging | Couchbase Docs . What you sent is your progams logs which don’t show enough detail. I need to see an entire run from start up to server shutdown, to rebalance, etc.

  • 43:03 Server error after 868 seconds: TransportFailure A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond this just indicates that the server itself is terminating the connection. If you shut a server down and do not follow the steps outlined in the documentation, this will eventually make the client return NodeUnavailableExceptions - because, as I said previously, the client is skipping sending any keys to the server because its unresponsive.

  • The other errors look normal when you shut a server and down and wait for it to failover without a rebalance - I have explained in detail in earlier posts.

Jeff

I am trying to paste the SDK logs but getting errors from these forums and I also had all my posts flagged as spam for some reason.

It won’t let me post pastebin links now, so I made new links:


About the TransportFailure: that’s fine for it to return an error after a node goes down, and we can gracefully bypass the cache and take the temporary performance hit until the cluster is rebalanced. However, sometimes the issue persists even after we do a cluster rebalance and as you can see in the log you linked, IBucket.Get or IBucket.GetAsync sometimes takes 15+ minutes before finally returning, if at all. That means our entire service effectively has downtime if a single couchbase node goes down for any reason.

@Trevor_Elliott

I created a Jira ticket earlier for you to post large files to: https://issues.couchbase.com/browse/NCBC-1880

Jeff

Sure, is Debug logging detailed enough? I don’t see all that much being logged by the SDK.

Actually it should be much more verbose (maybe 500mb or more) than what you pasted; with DEBUG, you should be able to see quite a bit of detail and follow operations, bootstrapping, config refreshes, etc. leading up to exceptions. I want to see from application startup to hanging for 15mins (maybe not quite that long…).

From what little you posted, note however, the remote party is closing the connectons, so I am not sure whats going on there. Perhaps the config is stale or the service itself on the remote node is not working correctly. With a detailed log I can rule out things and get closer to a solution.

Jeff

Does PoolConfiguration “MaxSize” value affects of how many requests per second can sdk send to a bucket on the node? Could this configuration be a bottleneck in case of heavy traffic?

MaxSize effects the # of socket connections used, so can effect total OPS. I wouldn’t go to high, 5-10 is usually suffucient. You can also tune SendTimeout down, but I would keep it equal to or above DefaultOperationTimeout (2500ms) - note this will show more noise on log if network/server is slow and cause more sockets to be created/destroyed.

Is there a formula which I could use to calculate number of connections based of amount of traffic our application gets to cache? By my calculation we use 10 connections to send about 300 operations per second. How do I know if that reasonable or not? Thanks

Hello,

We are experiencing the same issue with CB 4.5 and SDK 2.7. I see this issue was never resolved… Has anyone found a workaround?

@amit_e -

The issue hasn’t been resolved because we haven’t been able to replicate it on our side; note we explicitly test this scenario as a situational test every release and the .

Question, are you rebalancing after the failure if triggered? If not, you will need to.

-jeff

@jmorris

For what it’s worth, I have also seen lots problems with the .NET SDK not responding to rebalances. It’s happened to a lot of different applications/microservices over the last couple of years, primarily when doing rebalances as part of upgrading the cluster. I’ve never been able to nail down a reason why, recycling the application always fixes it. And since I can’t up log levels without restarting, I haven’t gotten any useful logs.

Since clearly rebalances are something you test extensively for, the best theory I’ve been able to come up with is that it’s somehow related to application up time. These apps that are failing have always been running for days before the rebalance. I’m thinking the SDK might eventually get into some state where the rebalance isn’t handled cleanly. And app up-time would be very difficult to include in automated tests. But that’s just a theory.

Brant

1 Like

@btburnett3 -

Thanks for the feedback; I’ll discuss with our QE team and spend some time trying to replicate.

-Jeff