In one particular environment (and only there) we’re experiencing loss of data.
I know our environment is not offically supported by Couchbase, but let me please ask anyway:
CB Community Ed 5.1.1
OS: OpenBSD 6.3 (we ported CB ourselves)
Number of objects in DB < 300.000
Average object size: <1k
2 indexes
12 views
4CPUs
8GB RAM
We have not found any trigger yet that causes the problem. When the problem appears, CB first seems to cause a high load on the machine, then the problem appears. Afterwards, we find the following in the logs:
######################################## memcached.log.000014.txt: unmodified: line 70309
2019-07-08T07:11:26.090852+02:00 WARNING (BUCKET) Slow runtime for ‘Connection Man
ager’ on thread nonIO_worker_0: 1528 ms
2019-07-08T07:11:29.657490+02:00 WARNING (BUCKET) Slow runtime for ‘Updating stat
snapshot on disk’ on thread writer_worker_1: 6052 ms
2019-07-08T07:11:32.481294+02:00 NOTICE 81: (BUCKET) DCP (Producer) eq_dcpq:secidx
:getseqnos-10:8a:42:16:94:8f:c7:37 - Removing connection 0xcf45c920b78
2019-07-08T07:11:41.377124+02:00 NOTICE 81: DCP connection opened successfully.
flags:{} [ 127.0.0.1:34215 - 127.0.0.1:11210 (@index) ]
2019-07-08T07:11:44.193807+02:00 WARNING (BUCKET) Slow runtime for ‘Backfilling it
ems for a DCP Connection’ on thread auxIO_worker_0: 559 ms
######################################## indexer.log: unmodified: line 1552
2019-07-10T13:16:13.467+02:00 [ERRO][FDB] docio_read_doc_key error: read failure
on offset 22220800 in a database file ‘/opt/couchbase/var/lib/couchbase/data/@2i/
BUCKET_#primary_295915238280948878_0.index/data.fdb.432’ : FDB status -5, lastbid 0
x1531, curblock 0xffffffffffffffff, curpos 0x0
(hex) 0x240277000, 4096 (0x1000) bytes
0000 08 08 00 01 00 01 00 5b ff ff ff ff ff ff ff ff …[…
0010 00 00 00 00 00 00 01 90 00 00 00 00 01 90 41 a3 …A.
0020 00 00 00 00 00 00 01 91 00 00 00 00 01 70 01 1e …p…
0030 00 00 00 00 00 00 01 92 00 00 00 00 01 72 b1 12 …r…
######################################## babysitter.log.1: unmodified: line 3123
** Reason for termination ==
** {abnormal,131}
[error_logger:error,2019-07-08T07:11:12.521+02:00,babysitter_of_ns_1@127.0.0.1:er
ror_logger<0.6.0>:ale_error_logger_handler:do_log:203]
=========================CRASH REPORT=========================
crasher:
initial call: ns_port_server:init/1
pid: <0.648.420>
registered_name:
exception exit: {abnormal,131}
in function gen_server:terminate/6 (gen_server.erl, line 744)
ancestors: [<0.645.420>,<0.647.420>,ns_child_ports_sup,
ns_babysitter_sup,<0.55.0>]
messages: [{‘EXIT’,<0.654.420>,normal}]
links: [<0.645.420>]
dictionary:
########################################
After this, we’ve lost a substantial amount of data.
Any hint what we could do to examine and work around this?