We do some tests and also encounter the critical exception of Couchbase cluster in prod environment.
The issue summary
One node shutdown wittingly will cause:
- Get ops of the cluster becomes 0
- Some of the putting records will be lost
- Need several minutes to “auto fail-over”, and then the cluster will resume to OK. ( We set auto fail-over time is 30s)
The test environment
- Couchbase 2.2
- 4 nodes
- Replicas: 2
- Every node: 4CPU, 16G memory
- C SDK version: 2.0.6
Testcase 1
Steps:
0. 10m records are set in the cluster, auto fail-over is enabled
- 3 clients to get from cluster
- Shutdown one node unexpectedly (crash)
Result: - All gets encounter timeouts, cannot get any data.
- About 4 minutes later, the cluster resume to OK.
Testcase 2
Steps:
0. 10m records are set in the cluster, auto fail-over is enabled
- Start clients to write 500k
- Shutdown one node when writing
Result: - Just the node shutdown, writing ops becomes 0
- After 1 minute the cluster resumes OK
- When checking the data: a) Writing timeouts count is 13 b) 5087 records cannot be found in the 500k c) Some of the originally set 10m records are lost d) The lost records cannot be found even mannually rebalanced.
Attached screenshots:
ops drop to 0 when one node crash
ops resume after 1 minute
one node is fail-over state
after rebalance action, the records count
The count should be: 10m + 500k. From desc of testcase 2 result, Couchbase lost some new data and also old existed data.