We’ve experienced a partial failure of our couchbase cluster (Community Edition 6.0.0 build 1693), where one of our 28 nodes failed due to underlying hardware issues (running in AWS), then 3 other more, successively, during rebalance / failover operations.
While this was happening, various processes were trying to perform batch creations/updates of documents, using the python sdk version 3.0.8. Needless to say lots of errors have been encountered and operations have been retried.
Eventually, the cluster stabilised, but in the aftermath, we’re noticing various degrees of data corruption of some documents:
- lost content,
get()
returningNone
instead of the JSON document - JSON data still there, document valid, but corrupted in a weird way:
- keys from it have had their suffixes trimmed. For example, the document
{"my_key": 1}
is now{"my_ke": 1}
- new data inserted, which doesn’t belong to us, but apparently is from the Python SDK. For example, the document
{"my_key": 1}
became{"my_key": "error_context": {"status_code": 0, "opaque": 42, "cas": 1620751392080658432, "key": "document-key-here", "bucket": "bucket", "collection": "", "scope": "", "context": "", "ref": "", "endpoint": "10.x.x.x:11210","type": "KVErrorContext"}
- keys from it have had their suffixes trimmed. For example, the document
While I can understand the returned None
values, I can’t imagine how it’s possible for document keys to have their suffixes trimmed, nor how it’s possible to have data from a SDK error appended in a random position of the document (although this last one could be explained by some obscure bug in our code, since we’re doing a “read-merge data-replace using cas” to update the documents).
At this moment I can’t confirm that the corruption only occured in documents attempted to update during the failure, or if other documents is also affected (we’ve have around 1,3 mld documents, so it’s kinda hard to estimate the impact).
Is there any known issue in the mentioned server version and/or Python SDK which could explain the data corruption experienced at 2) ?