Previously I ran with Couchbase Server 4.5 and Sync Gateway 1.4.1, and it worked well.
I updated all to Couchbase Server 5.0 and Sync Gateway 1.5.1 recently, and I got a some trouble.
Two Couchbase Servers show CPU 5%~15%, RAM 50%, OPs per second 1K~3K.
But Sync Gateway servers reach CPU 100% within a minute.
And I don’t know what should I do.
Sync Gateways seem to process request well.
2018-04-06T09:16:46.499Z HTTP: #4410: POST /mybucket/_changes (as 32683737-cf7b-440a-bfcf-b4f656b9c656)
2018-04-06T09:16:46.499Z HTTP: #4414: POST /mybucket/_bulk_docs (as 25caa0b2-cfff-4e98-89be-008245df907a)
2018-04-06T09:16:46.499Z HTTP: #4411: POST /mybucket/_changes?feed=longpoll&heartbeat=45083&style=all_docs&since=683687687 (as 0a97a0e4-b733-4c86-82e4-3cf11710391b)
2018-04-06T09:16:46.499Z HTTP+: #4407: → 201 (86.3 ms)
2018-04-06T09:16:46.501Z HTTP+: #4102: → 201 (5282.6 ms)
2018-04-06T09:16:46.502Z HTTP: #4412: POST /mybucket/_bulk_docs (as 2d1bc4fe-60c9-49ed-958e-99e7110c9ac4)
2018-04-06T09:16:46.525Z HTTP+: #4405: → 201 (130.9 ms)
2018-04-06T09:16:46.527Z HTTP+: #4416: → 200 (32.9 ms)
2018-04-06T09:16:46.527Z HTTP: #4413: POST /mybucket/_revs_diff (as 31486d86-1a12-4f4d-9fe5-60742f7b9818)
2018-04-06T09:16:46.552Z HTTP: #4415: POST /mybucket/_bulk_docs (as 8a210afb-4db6-48db-a6a4-2c59afc0f67f)
2018-04-06T09:16:46.553Z HTTP+: #3445: → 200 (17685.0 ms)
2018-04-06T09:16:46.565Z HTTP+: #4398: → 200 OK (0.0 ms)
2018-04-06T09:16:46.572Z HTTP: #4417: PUT /mybucket/_local/df5047454d2f87206bcf8802cda3b265f01559d7 (as 83c16f09-f446-4c3c-b786-74ffc654b151)
2018-04-06T09:16:46.588Z HTTP: #4418: POST /mybucket/_bulk_docs (as 2d1bc4fe-60c9-49ed-958e-99e7110c9ac4)
2018-04-06T09:16:46.607Z HTTP+: #4417: → 201 (72.2 ms)
2018-04-06T09:16:46.607Z HTTP: #4419: PUT /mybucket/_local/0e08a0fcc6fd01e90c0405d94d06e82b5a1e1f39 (as 3a32c428-3b68-4869-bb8a-57ac3e6e8715)
2018-04-06T09:16:46.616Z HTTP+: #4403: → 200 OK (0.0 ms)
2018-04-06T09:16:46.646Z HTTP+: #4367: → 200 OK (0.0 ms)
2018-04-06T09:16:46.758Z HTTP: #4432: POST /mybucket/_changes (as 4510cc5f-d2c0-4f6c-9d7c-f5224ef9b152)
2018-04-06T09:16:46.758Z HTTP: #4435: PUT /mybucket/_local/39dcee8d74a20cf8daf9cd94c8f40dc70256bf66 (as 41d328d3-9168-4ccf-a43a-394c0d36804d)
2018-04-06T09:16:46.758Z HTTP+: #4432: → 200 OK (0.0 ms)
2018-04-06T09:16:46.768Z HTTP+: #311: → 200 OK (0.0 ms)
But times go, there are some strange behaviors.
So I killed sync gateway processed per two minutes now.
Some doubtful logs are
2018-04-06T06:25:24.182Z WARNING: Error returned when releasing sequence 683399154. Falling back to skipped sequence handling. Error:operation has timed out – db.(*Database).updateAndReturnDoc() at crud.go:1044
2018-04-06T06:25:24.183Z WARNING: backupAncestorRevs failed: doc=“mydocmydoc” rev=“1-7a2ba9a9945b13ce12b03db2fa5d309d” err=operation has timed out – db.(*Database).backupAncestorRevs() at crud.go:520
2018-04-06T06:25:24.314Z WARNING: backupAncestorRevs failed: doc=“local:mydocmydoc” rev=“10-076f71922c819cc9d3525de9420cc7a6” err=operation has timed out – db.(*Database).backupAncestorRevs() at crud.go:520
2018-04-06T09:20:29.081Z changes_view: Query took 216.544767ms to return 33 rows, options = db.Body{“startkey”:interface {}{“7ce0f464-c981-4199-9632-0205b61f1cfc”, 0x1}, “endkey”:interface {}{“7ce0f464-c981-4199-9632-0205b61f1cfc”, 0x28c0622f}, “stale”:false}
And I lost some documents(about 4%?) while upgrade Couchbase Server 5.0 from 4.5 by mistake.
Another change is that reverse proxy is changes from Nginx to AWS Load Balancer. (for scaling)
When I upgraded cluster, I added one Couchbase Server 5.0 node to the live cluster of two 4.5 nodes.
And something was not going well, I removed 5.0 node.
But the cluster was still doing rebalance, and two 4.5 nodes had different number of documents.
(I guess due to the replication setting, new documents was not distributed well)
So I removed one 4.5 node with failover ( I didn’t understand Couchbase Server concept then )
Finally, I copied remain documents to new cluster of one 5.0 node with cbtransfer.
But cbtransfer stopped at 96%
So I guess I lost some documents.
But I don’t think documents loss is not the cause of CPU 100%.
Avarage HTTP requests per minute is 6000~700 (peak above 15000).
Does SG itself needs initial processing? I might wait more than hours for my memory.
After that, I killed SG processes every two minutes before I downgraded.
I don’t have good numbers on time. With CBS 5.0 and SG 1.5 there will be processing to use XATTRS for all the sync information. Hours seems like plenty for initializing, but I think 100M docs will take quite a while. So it may just not have run long enough.
It would be a lot faster if you don’t need all the docs. You can set up filters for SG. You might look at using that.
Once the initial setup has gone through, restarting should be fairly quick.
No. You will have to specify that config flag to enable XAttrs. There is also an import_docs flag that you need to set up.
And this s is the import_filter function to set up. As Hod, pointed out, you may want to re-visit what documents you want to process via the SGW . If you don’t expect to sync all the documents to mobile clients, then filter those from processing.
Didn’t quite follow your question but XAttrs are added by SGW as it processes the documents that it imports. Check out blog on the using shared bucket access .
You may want to check out this upgrade guide on upgrading from pre-XAttr version of SGW.