URGENT - The number of documents in my bucket dropped by 300,000

TLDR: The number of documents in my database abruptly dropped from 1,150,000 to 850,000 today, and I can’t figure out why.

After some strange sync gateway behaviour (documents weren’t being properly replicated to all client devices), I went to diagnose the issue on Couchbase server. I decided to start creating a backup before I looked at things, and then went back to the console and saw the number of documents drop by the amount mentioned above, then the console crashed. I’ve been trying to diagnose the issue for the past four hours or so, and am barely closer to figuring out what caused it.

A few things that seem important:

The first error in the logs was thrown by memchached:

Service 'memcached' exited with status 137. Restarting. Messages: 2017-10-09T10:42:20.087445Z WARNING 43: 
Slow STAT operation on connection: 703 ms ([ 127.0.0.1:51593 - 127.0.0.1:11209 (Admin) ])
2017-10-09T10:45:23.882821Z WARNING 42: Slow STAT operation on connection: 539 ms ([ 127.0.0.1:53991 - 
127.0.0.1:11209 (Admin) ]) 
2017-10-09T10:54:30.075347Z WARNING 45: Slow STAT operation on connection: 578 ms ([ 127.0.0.1:39257 -  
127.0.0.1:11209 (Admin) ])
2017-10-09T10:58:33.994958Z WARNING 43: Slow STAT operation on connection: 607 ms ([ 127.0.0.1:51593 - 
127.0.0.1:11209 (Admin) ])
2017-10-09T11:07:45.909795Z WARNING 43: Slow STAT operation on connection: 524 ms ([ 127.0.0.1:51593 - 
127.0.0.1:11209 (Admin) ])	

Following this, the the logs show this erorr: Control connection to memcached on 'ns_1@127.0.0.1' disconnected: {{badmatch, {error, closed}},… (there’s a lot more)

and: Service 'memcached' exited with status 134. Restarting

Following this there were a few compaction errors:

Compactor for view `<my_bucket>/_design/sync_housekeeping` (pid [{type,
                                                                 view},
                                                                {name,
                                                                 <<"<my_bucket>/_design/sync_housekeeping">>},

Following this couchbase crashed

I’ve considered the possibility that compaction just hadn’t run before, but it doesn’t seem possible that that many expired documents were in the database - there’s no way for clients to delete more than one document at a time.

Happy to provide more detailed logs, or any other information, just wanted to get this up asap, because I’m really not sure what next steps are. It seems highly unlikely that my the data is actually gone, but also something has definitely gone wrong.

Thanks!

@jens @andy @priya.rajagopal @borrrden @adamf @hod.greeley

Sorry to poke you all, just wondering if you could suggest next steps or have any ideas why this might be happening? It’s effecting our prod server, and our redundancy managed to fail at the same time (error on my part), so I’m pretty worried about this.

Thanks all

That means something killed the memcached (data service) with -9 (SIGKILL). Most commonly (assuming you didn’t manually kill -9 it) would be Linux encountering OOM and killing processes to recover.

Could any of the above errors have caused a large loss in data?

Doesn’t look like it. Can you describe your cluster more?

Sure!

We’re running couchbase sever with a single node, on an aws m3-medium instance (which I’m aware is well below the minimum requirements), and sync gateway on a separate m3-medium aws instance.

First, to clarify the above, here is a more clear explanation of sequence of events:

7:30 pm - I noticed strange sync gateway behaviour. Changes made on one device were either only being partially replicated to other devices, or not being replicated at all. This seems to correspond with the
9:30pm - after restarting sync gateway, I decided to see if the changes were being reflected on the server.
9:35pm - I initiated a backup using a backup script that had been used previously
9:40pm - As I was on the documents tab of the couchbase console, the document count dropped from 1,150,000 to 880,000
9:45 Couchbase server crashed

After reviewing logs, the following components appear to have failed:

We’ve done a ton of digging through logs, and here’s what it looks like happened

  • Memcache failed at roughly 11:15GMT:

Service 'memcached' exited with status 137. Restarting. Messages: 2017-10-09T10:42:20.087445Z WARNING 43: Slow STAT operation on connection: 703 ms ([ 127.0.0.1:51593 - 127.0.0.1:11209 (Admin) ])

  • We didn’t notice memcache had failed until I noticed that replications weren’t working properly (I’m assuming that memcache failing would lead to this?)

  • I went to diagnose the problem and initiated a backup, which started to pull documents into memory

  • Either just be coincidence, or because of the extra load on the system, memcache failed again at the same time as compaction failed

    Service ‘memcached’ exited with status 134. Restarting. Messages: 2017-10-09T18:49:21.232229Z NOTICE (maisha-meds-sg) DCP (Producer) eq_dcpq:cbbackup-wFzATINGXovxZcOr - (vb 379) Scheduling backfill from 1 to 458, reschedule flag : False
    2017-10-09T18:49:21.232360Z NOTICE (maisha-meds-sg) DCP (Producer) eq_dcpq:cbbackup-wFzATINGXovxZcOr - (vb 378) Creating stream with start seqno 0 and end seqno 7
    2017-10-09T18:49:21.232411Z NOTICE (maisha-meds-sg) DCP (Producer) eq_dcpq:cbbackup-wFzATINGXovxZcOr - (vb 378) Scheduling backfill from 1 to 7, reschedule flag : False
    2017-10-09T18:49:21.232527Z NOTICE (maisha-meds-sg) DCP (Producer) eq_dcpq:cbbackup-wFzATINGXovxZcOr - (vb 829) Creating stream with start seqno 0 and end seqno 10
    2017-10-09T18:49:21.232581Z NOTICE (maisha-meds-sg) DCP (Producer) eq_dcpq:cbbackup-wFzATINGXovxZcOr - (vb 829) Scheduling backfill from 1 to 10, reschedule flag : False

    Compactor for database maisha-meds-sg (pid [{type,database},
    {important,true},
    {name,<<“maisha-meds-sg”>>},
    {fa,
    {#Fun<compaction_new_daemon.4.102846360>,
    [<<“maisha-meds-sg”>>,
    {config,
    {30,undefined},
    {30,undefined},
    undefined,false,false,
    {daemon_config,30,131072,
    20971520}},
    false,
    {[{type,bucket}]}]}}]) terminated unexpectedly: {{{badmatch,

Our best guess for what happened is that the compactor was deleting and re-creating batches of documents, and was interrupted at some point between the delete and re-create by the server crashing, which led to document loss.

Does that sound feasible?