We are running SG 1.5.1 (Couchbase Sync Gateway/1.5.1(4;cb9522c)) on top of CB 5.0, community editions. We’ve recently upgraded from SG 1.4 and CB 4 - we upgraded SGs first, then the CB cluster. We’re having a problem where our Sync Gateways are locking up and the the database access is timing out. We can “unlock” the problem by restarting the SG.
After a restart of the SGs, this is the way that the error seems to unroll:
First, we get lots of JSON parse errors as users get their feeds:
2018-01-15T14:33:03.536Z HTTP: #928: POST /[database]/_revs_diff (as user.2)
2018-01-15T14:33:03.597Z WARNING: Unexpected error parsing body of rev "19-63fb3798d240d46358c2842b41030532" -- db.(*document).getNonWinningRevisionBody() at document.go:309
2018-01-15T14:33:03.603Z WARNING: Unexpected error parsing body of rev "4-13c5a67bf539c4f1b5d36b40cfdd308c" -- db.(*document).getNonWinningRevisionBody() at document.go:309
2018-01-15T14:33:03.610Z WARNING: Unexpected error parsing body of rev "36-2fc399f9b4091cbe8afb61b3cbff8bae" -- db.(*document).getNonWinningRevisionBody() at document.go:309
Then we’re getting SGs locking up. I’m not sure if it’s a fluke but this has happened twice now when the number of changes feeds has reached 100… this appears in the logs:
2018-01-15T14:54:23.696Z WARNING: RetryLoop for Get acc.settings.117.jQbFKJNOjnsIsS19Ok-7CA|snag_codes giving up after 11 attempts -- base.RetryLoop() at util.go:298
2018-01-15T14:54:23.757Z WARNING: RetryLoop for Get acc.users.117|2208 giving up after 11 attempts -- base.RetryLoop() at util.go:298
2018-01-15T14:54:23.757Z WARNING: Changes feed: error getting doc sync data "acc.users.117|2208": operation has timed out -- db.(*Database).addDocToChangeEntry() at changes.go:106
2018-01-15T14:54:23.929Z WARNING: RetryLoop for Get _sync:user:user.2553 giving up after 11 attempts -- base.RetryLoop() at util.go:298
2018-01-15T14:54:23.929Z HTTP auth failed for username="user.2553"
2018-01-15T14:54:23.929Z HTTP: #3317: PUT /[database]/_local/b993d11acdec2288a2024897aac6a684ed3af316
2018-01-15T14:54:23.929Z HTTP: #3317: --> 401 Invalid login (32619.6 ms)
...
2018-01-15T14:58:11.791Z HTTP: #3496: POST /[database]/_bulk_docs (as user.2554)
2018-01-15T14:58:12.046Z WARNING: RetryLoop for Get acc.settings.117.jQbFKJNOjnsIsS19Ok-7CA|snag_codes giving up after 11 attempts -- base.RetryLoop() at util.go:298
2018-01-15T14:58:15.259Z WARNING: RetryLoop for Get acc.settings.117.jQbFKJNOjnsIsS19Ok-7CA|snag_codes giving up after 11 attempts -- base.RetryLoop() at util.go:298
2018-01-15T14:58:16.455Z WARNING: RetryLoop for Get _sync:seq giving up after 11 attempts -- base.RetryLoop() at util.go:298
2018-01-15T14:58:16.460Z HTTP: #3497: GET /_expvar (ADMIN)
2018-01-15T14:58:16.460Z HTTP: Recording snapshot of current debug variables.
2018-01-15T14:58:16.488Z HTTP: #3498: GET /[database]/ (ADMIN)
2018-01-15T14:58:18.652Z WARNING: RetryLoop for Incr with key: _sync:seq giving up after 11 attempts -- base.RetryLoop() at util.go:298
2018-01-15T14:58:18.652Z WARNING: Error from Incr in sequence allocator (1) - attempt (1/3): operation has timed out -- db.(*sequenceAllocator).incrWithRetry() at sequence_allocator.go:101
2018-01-15T14:58:18.662Z WARNING: Too many unsuccessful Incr attempts in sequence allocator - giving up (1): operation has timed out -- db.(*sequenceAllocator).incrWithRetry() at sequence_allocator.go:107
2018-01-15T14:58:18.662Z WARNING: Error from Incr in _reserveSequences(1): Unable to increment sequence: operation has timed out -- db.(*sequenceAllocator)._reserveSequences() at sequence_allocator.go:68
2018-01-15T14:58:19.112Z HTTP: #3499: POST /[database]/_bulk_docs (as user.2554)
2018-01-15T14:58:21.550Z WARNING: RetryLoop for Get _sync:seq giving up after 11 attempts -- base.RetryLoop() at util.go:298
After that the stats monitoring calls that we make to our database URL and _expvars
endpoint via stats start to timeout and take about 30s. Our service goes down and we need to restart the SG to bring it back.
Earlier today we were able to wait around 40 minutes and the Sync Gateways came back online, however our clients are up and using the service now so we can’t wait that long at the moment.
This is our config:
{
"interface": "localhost:4984",
"adminInterface": "[IP address]:4985",
"CORS": {
"Origin":["https://[url]","https://[url]"],
"Headers":["Authorization","Content-Type"]
},
"maxFileDescriptors": 250000,
"databases": {
"[database]": {
"server": "http://cb01:8091",
"bucket": "[bucket]",
"username": "[user]",
"password": "",
"unsupported": {
"user_views": {"enabled":true},
"enable_extended_attributes": {"enabled":false}
}
}
},
"logging": {
"default": {
"logFilePath": "/var/log/sync_gateway/sync_gateway_access.log",
"logKeys": ["HTTP"],
"logLevel": "debug",
"rotation": {
"maxsize": 1,
"maxage": 30,
"maxbackups": 7,
"localtime": true
}
}
}
}
We realised when this error started happening that we’d been running the SGs with enable_extended_attributes.enabled: true
and have reverted that to false
- could that have crated this issue?
Has anyone got any suggestions / fixes please? Is this related to xattrs? Any advice would be great.