.NET SDK fails to recover after cluster automatic failover

Hi there.

We are experiencing issues with .net sdk client inside asp.net web app, when it is not able to recover after one of the cluster nodes disappears and automatic failover happens. Client keeps resulting with NodeUnavailable or OperationTimeout errors. We have to recycle application pool to restart application then it can reconnect back to cluster. We use client version 2.5.5 in prod. But I also tried to upgrade to 2.7.3 in stage with the same result. I can easily reproduce the problem by starting loadtest on web app and killing couchbase processes on one of the nodes.

Thanks.

Hi @maximv

Please can you share:

  • client configuration
  • how you connect to the cluster
  • how you re-use the cluster instance through your application
  • SDK logs when you see NodeUnavailable & OperationTimeout exceptions

We have testing in place for this scenario so I’m not sure what’s happening.

Details on how to get SDK logging, if you don’t already have it setup: Logging | Couchbase Docs

Thanks

Our configuration:

  <configSections>
    <sectionGroup name="couchbaseClients">
      <section name="couchbase" type="Couchbase.Configuration.Client.Providers.CouchbaseClientSection, Couchbase.NetClient" />
    </sectionGroup>
    <sectionGroup name="common">
      <section name="logging" type="Common.Logging.ConfigurationSectionHandler, Common.Logging" />
    </sectionGroup>    
    <!--<section name="couchbaseSession" type="Couchbase.Configuration.CouchbaseClientSection, Couchbase"/>-->
    <section name="log4net" type="log4net.Config.Log4NetConfigurationSectionHandler, log4net" />
  </configSections>
  <common>
    <logging>
      <factoryAdapter type="Common.Logging.Log4Net.Log4NetLoggerFactoryAdapter, Common.Logging.Log4Net1215">
        <arg key="configType" value="INLINE" />
      </factoryAdapter>
    </logging>
  </common>
  <couchbaseClients>
  <couchbase>
      <servers>
        #{each server in Couchbase_Servers}
        <add uri="http://#{server}:8091"/>
        #{/each}
      </servers>
      <buckets>
        <add name="VersionBucket" useSsl="false" password="#{Couchbase_DataBucket_BucketPassword}" operationLifespan="2000">
          <connectionPool maxSize="10" minSize="5" sendTimeout="12000"></connectionPool>
        </add>
        <add name="DataBucket" useSsl="false" password="#{Couchbase_VersionBucket_BucketPassword}" operationLifespan="2000">
          <connectionPool maxSize="10" minSize="5" sendTimeout="12000"></connectionPool>
        </add>
      </buckets>
      <serializer name="Binary" type="Q4Foundation.Cache.Couchbase.Serializers.Binary, Q4Foundation.Cache.Couchbase" />
  </couchbase>
  </couchbaseClients>

This is line of code we call to initialize client:

static CouchbaseCacheProviderFactory()
{
    ClusterHelper.Initialize("couchbaseClients/couchbase");
}

It is static constructor. So it ensures code is called just once.

Each time we need to access bucket we call this line:

ClusterHelper.GetBucket(“DataBucket”);

Thanks - the code examples for initialization and getting a bucket look fine.

The only thing that I’d query is the Binary serializer. I believe there was a bug previously when a custom serializer was used when processing a new cluster map, which occurs when a failover happens. Would you be able to try test in your local system with the default serializer to see if that’s the cause?

I commented the custom serializer line in config file and rerun the test. I got same problem.
Basically what happens is that client sdk detects the change:

25 Jan 2019 16:11:05.600Fri Jan 25 15:11:05 -06:00 2019 Couchbase.Configuration.Server.Providers.ConfigProviderBase : INFO, Config changed (forced:False) new Rev#80931 | old Rev#80925

But then I see these 2 lines:

25 Jan 2019 16:11:05.601Fri Jan 25 15:11:05 -06:00 2019 Couchbase.CouchbaseBucket : INFO, Config updated old/new: 80931, 80931,

and

25 Jan 2019 16:11:08.111Fri Jan 25 15:11:08 -06:00 2019 Couchbase.Configuration.Server.Providers.ConfigProviderBase : INFO, Ignoring config with rev#80931,

and SDK keeps sending requests to failed node:

25 Jan 2019 16:14:54.006Fri Jan 25 15:14:53 -06:00 2019 MemCached : ERROR, Couldn't store data for the key 038672619|5.15.0.6_99bc769f-c02e-4469-b877-8b887c6b1d1d_4|10c11b86-a3a4-41d5-a716-0796f2fbd6ea|MobileLinkItem Status: NodeUnavailable Message: The node 172.16.30.100:11210 that the key was mapped to is either down or unreachable. The SDK will continue to try to connect every 1000ms. Until it can connect every operation routed to it will fail with this exception. Exception: Couchbase.Core.NodeUnavailableException: The node 172.16.30.100:11210 that the key was mapped to is either down or unreachable. The SDK will continue to try to connect every 1000ms. Until it can connect every operation routed to it will fail with this exception.,

Sounds like there hasn’t been a rebalance after the failover; the client is simply bypassing that node because its no longer part of the cluster.

I am not sure if I understand correctly. The desired result is that client doesn’t send requests to the node that has been failedover and is not part of the cluster anymore . The actual result is that client keeps sending requests to that node.

The data hasn’t moved yet, so the client still tries to go to the down node. When the cluster is rebalaced the data will move to another node and the client cluster map updated - the client will then go to the data’s new location after the rebalance. Another way to handle this case is by configuring replicas and by temporarily doing a replica read until the rebalance is triggered. In either way you need to perform a rebalance.

One replica is setup, otherwise we would not be able to configure automatic failover.

I have 3 questions:

  1. Why client stops going to the failed node as soon as we recycle application pool?
  2. How do I know if the key maps to failed node in order to read from replica?
  3. I don’t really care about replica data. We use couchbase as a memory cache. So it is fine to discard data on failed node and start from scratch. Is there a way to configure failover/client in such way, that as soon as node is out, no more traffic is sent to that node?

Not sure, as described you’ll need a rebalance to occur so that client can map the data to the new node. Just re-bootstrapping (what happens during an app-pool recycle) isn’t enough, the situation is descibed here.

You will get a NodeUnavailable response - the key is mapped to a node that the client cannot communicate with. You can then do a replica-read.

That is exactly what happens when a NodeUnavailble is returned - the client detects that the is down and immediately returns NodeUnavailable without sending an operation/traffic to that node. The exception here is that the client will try to reconnect to a down node using a heartbeat mechanism - this is done in the background.

Is this a Couchbase or Memached bucket? Which server version?

-Jeff

We use couchbase buckets. The server version is 4.1.0

Also after digging logs I noticed that, when we loose server, we get bunch of “Safe handle has been closed” exceptions:

29 Jan 2019 23:21:33.069Tue Jan 29 22:21:33 -06:00 2019 Q4WebLogger : ERROR, Message: Failed to retrieve data for key 038672619|5.15.0.6 Cache Operation Status Status: OperationTimeout Cache Operation Message: The operation has timed out. Cache Operation Exception: System.ObjectDisposedException: Safe handle has been closed
   at System.Runtime.InteropServices.SafeHandle.DangerousAddRef(Boolean& success)
   at System.StubHelpers.StubHelpers.SafeHandleAddRef(SafeHandle pHandle, Boolean& success)
   at Microsoft.Win32.Win32Native.ResetEvent(SafeWaitHandle handle)
   at System.Threading.EventWaitHandle.Reset()
   at Couchbase.IO.MultiplexingConnection.Send(Byte[] request)
   at Couchbase.IO.Services.PooledIOService.Execute[T](IOperation`1 operation)

and then we start getting “OperationTimeout” exceptions until manual application pool recycle:

 29 Jan 2019 23:21:39.150Tue Jan 29 22:21:39 -06:00 2019 Q4WebLogger : ERROR, Message: Failed to retrieve data for key 95f68b78-c417-468c-92b3-bc3de389c8b9-Q4Foundation.Cache.Repository.CachedSiteHostnameRepository.GetByHostnameClientUid.[hostname, goldcorp2014.q4preview.com],[clientUid, 038672619] Cache Operation Status Status: OperationTimeout Cache Operation Message: The operation has timed out. Cache Operation Exception: System.Net.Sockets.SocketException (0x80004005): No connection could be made because the target machine actively refused it
   at Couchbase.IO.DefaultConnectionFactory.<>c__0`1.<GetGeneric>b__0_0(IConnectionPool`1 p, IByteConverter c, BufferAllocator b)
   at Couchbase.IO.SharedConnectionPool`1.CreateAndAuthConnection()
   at Couchbase.IO.SharedConnectionPool`1.Acquire()
   at Couchbase.IO.Services.PooledIOService.Execute[T](IOperation`1 operation)
   at Couchbase.Core.Server.Send[T](IOperation`1 operation)

Couchbase buckets are persistent buckets, so you can access replicas if you have them enabled whether or not your application is using CB as a cache.

This might be part of your problem; pre-5.0 servers have a minimum failover time of 30s and a default of 120s. That means if you completely killed a node (taken off line) it can take as long as 120s for this to be detected by the server and then the failover process must occur. During this time, the client’s connections will fail nearly immediatly and any keys mapped to that node will recieve NodeUnavailable until its restored or failover completes - initiating a rebalance is required here as well (per the docs). You may want to adjust this value on the server. Note that in Couchbase Server 5.0 we released a new feature called “Fast-Failover” which reduces the minimum failover time to as low as 5s.

This a normal side effect of the state the cluster is in; multiple threads are sharing connections and the connection is closed on the server’s end. It’s an effect, not a cause - this particular error is not concerning.

You can either retry or better yet do a replica read here until the server maintance has completed.

From explanation above, I was expecting to see NodeUnavailable errors instead of timeouts. Or at least see timeouts until failover is complete and then NodeUnavailable errors.

The problem with timeouts, it is too late to make replica call because timeouts take too long. Lets say I make 15 requests to 3 node couchbase cluster during one web request processing (1 page rendering). When one node fails, half of requests will start timing out. If timeout is 1 second, It will take 5 seconds in total for all requests to failed node. In this case web application is pretty much is unresponsive and is considered down. Our goal is to make sure that even if one node disappears end-user experience is unaffected.

For some reason I was hoping that client figures out automatically that one node is unavailable after failover is complete and stops sending requests to that node.

Any pending operation may timeout if its in flight when the node goes down; how long do you wait for a reply? When a node goes down it takes some time for the server to detect that its really down and not just a temp network glitch or TCP RST. Once the cluster determines it is down, then after the failover delay, failure will commence. While your in this state, its not a good state - connections are being created and torn down as the cluster topology changes and the client adjusts to the new topology.

Timeouts can be adjusted at the operation level (DefaultOperationLifespan) and at the connection level (SendTimeout).

That is exactly what happens - client detects that a node is down and stops sending operations to it - instead NodeUnavailble is returned. Note that when a node is offline, you only have access to 2/3rds of the data (you removed 1 of 3 nodes and the master copy of data for that node), unless you are doing replica reads.

This is not what I see in reality. I see operation timeouts even after node failover is completed.

@maximv -

How long after failure completes? Are you seeing them in the logs or in the operation response?

I am following up on this issue on behalf of maximv and I am running into the same problem with timeouts in the SDK.

I have created a simple standalone ASP.NET Core app with a single GET endpoint and I am testing this endpoint with loadster. I have configured a couchbase cluster with a single bucket to take from a pool of random keys and try to Get the key, or Upsert if the key is not found.

  • My cluster is composed of 3 couchbase servers with auto-failover set to 60 seconds.
  • At the beginning of my test, response time is almost instantaneous and errors are nonexistent under a full load test.
  • A couple of minutes into my load test, I simulate a server outage by instantly killing one of the couchbase nodes.
  • I am timing how long each IBucket.Get(key) operation takes and logging when this operation takes longer than 1 second.
  • For the first 60 seconds after taking the node offline, I see a flood of timeouts happening , some taking more than 20 seconds before timing out.
  • At the 1 minute mark, failover happens.
  • I’ve waited 5-10 minutes after the failover and continued running the load test, and I still see constant timeouts happening of 20+ seconds.

This means that when there is a problem with any cache nodes going offline, the app becomes unusable.

To test the theory that this is an in-memory problem with the SDK, I waited 10 minutes before quickly closing and relaunching my ASP.NET Core app while the load test was running. Immediately I saw that the timeouts ceased to happen and the response times logged in the load test returned to near instantaneous.

Here is a snapshot of the graph of response times with the node offlining happening at the 2.5 minute mark, and the app restart happening at the 15 minute mark:

My cluster configuration is pretty simple and looks like this:

            new ClientConfiguration
            {
                BucketConfigs = new Dictionary<string, BucketConfiguration>
                {
                    {
                        "DataBucket",
                        new BucketConfiguration
                        {
                            BucketName = "DataBucket",
                            Password = "testpassword",
                            UseSsl = false,
                            PoolConfiguration = new PoolConfiguration
                            {
                                MinSize = 5,
                                MaxSize = 10
                            }
                        }
                    }
                },
                Servers = new List<Uri>(new[]
                {
                    new Uri("http://192.168.0.12:8091"),
                    new Uri("http://192.168.0.13:8091"),
                    new Uri("http://192.168.0.19:8091")
                })
            }

I created a Jira ticket for tracking; can you attach your demo code to ths ticket? I’ll try to replicate and see what’s going on here.

Thanks,
Jeff

Hi Jeff,

Here’s the repo:

Thanks,

Trevor