Received error 503 when contacting the query service

Hi,

I’m running into an issue when a sudden spike in N1QL queries causes the query service on a Couchbase node to stop accepting any requests until I manually restart either the node or the service (pkill -9 -f /opt/couchbase/bin/cbq-engine).

This is how the Query page looks like in the web console:

Note that the error is not transient, it does not fix by itself.

Do you have any idea on what could be the issue that query service does not automatically restart?

Couchbase version is Couchbase Server Community Edition 7.0.2 build 6703

Here are error logs:

[ns_server:error,2022-03-16T10:38:04.859Z,ns_1@172.17.0.2:<0.428.0>:prometheus:post_async:188]Prometheus http request failed:
URL: http://127.0.0.1:9123/api/v1/query
Body: query=%7Bname%3D~%60kv_curr_items%7Ckv_curr_items_tot%7Ckv_mem_used_bytes%7Ccouch_docs_actual_disk_size%7Ccouch_views_actual_disk_size%7Ckv_ep_db_data_size_bytes%7Ckv_ep_bg_fetched%60%7D+or+kv_vb_curr_items%7Bstate%3D%27replica%27%7D+or+kv_vb_num_non_resident%7Bstate%3D%27active%27%7D+or+label_replace%28sum+by+%28bucket%2C+name%29+%28irate%28kv_ops%7Bop%3D%60get%60%7D%5B1m%5D%29%29%2C+%60name%60%2C%60cmd_get%60%2C+%60%60%2C+%60%60%29+or+label_replace%28irate%28kv_ops%7Bop%3D%60get%60%2Cresult%3D%60hit%60%7D%5B1m%5D%29%2C%60name%60%2C%60get_hits%60%2C%60%60%2C%60%60%29+or+label_replace%28sum+by+%28bucket%29+%28irate%28kv_cmd_lookup%5B1m%5D%29+or+irate%28kv_ops%7Bop%3D~%60set%7Cincr%7Cdecr%7Cdelete%7Cdel_meta%7Cget_meta%7Cset_meta%7Cset_ret_meta%7Cdel_ret_meta%60%7D%5B1m%5D%29%29%2C+%60name%60%2C+%60ops%60%2C+%60%60%2C+%60%60%29+or+sum+by+%28bucket%2C+name%29+%28%7Bname%3D~%60index_data_size%7Cindex_disk_size%7Ccouch_spatial_data_size%7Ccouch_spatial_disk_size%7Ccouch_views_data_size%60%7D%29&timeout=5s
Reason: {failed_connect,[{to_address,{"127.0.0.1",9123}},
                         {inet,[inet],econnrefused}]}
[ns_server:error,2022-03-16T10:38:04.860Z,ns_1@172.17.0.2:<0.433.0>:prometheus:post_async:188]Prometheus http request failed:
URL: http://127.0.0.1:9123/api/v1/query
Body: query=%7Bcategory%3D%60system-processes%60%2Cname%3D~%60sysproc_mem_resident%7Csysproc_mem_size%7Csysproc_cpu_utilization%7Csysproc_major_faults_raw%60%7D&timeout=5s
Reason: {failed_connect,[{to_address,{"127.0.0.1",9123}},
                         {inet,[inet],econnrefused}]}
[ns_server:error,2022-03-16T10:38:04.860Z,ns_1@172.17.0.2:<0.436.0>:prometheus:post_async:188]Prometheus http request failed:
URL: http://127.0.0.1:9123/api/v1/query
Body: query=%7Bcategory%3D%60system%60%2Cname%3D~%60sys_cpu_utilization_rate%7Csys_cpu_stolen_rate%7Csys_swap_total%7Csys_swap_used%7Csys_mem_total%7Csys_mem_free%7Csys_mem_limit%7Csys_cpu_cores_available%7Csys_allocstall%60%7D&timeout=5s
Reason: {failed_connect,[{to_address,{"127.0.0.1",9123}},
                         {inet,[inet],econnrefused}]}
[ns_server:error,2022-03-16T10:38:04.895Z,ns_1@172.17.0.2:ns_heart<0.382.0>:ns_heart:grab_one_service_status:409]Failed to grab service index status: {exit,
                                      {noproc,
                                       {gen_server,call,
                                        ['service_agent-index',get_status,
                                         2000]}},
                                      [{gen_server,call,3,
                                        [{file,"gen_server.erl"},{line,223}]},
                                       {ns_heart,grab_one_service_status,1,
                                        [{file,"src/ns_heart.erl"},
                                         {line,406}]},
                                       {ns_heart,
                                        '-grab_service_statuses/0-lc$^1/1-1-',
                                        1,
                                        [{file,"src/ns_heart.erl"},
                                         {line,402}]},
                                       {ns_heart,current_status_slow_inner,0,
                                        [{file,"src/ns_heart.erl"},
                                         {line,276}]},
                                       {ns_heart,current_status_slow,1,
                                        [{file,"src/ns_heart.erl"},
                                         {line,235}]},
                                       {ns_heart,update_current_status,1,
                                        [{file,"src/ns_heart.erl"},
                                         {line,172}]},
                                       {ns_heart,handle_info,2,
                                        [{file,"src/ns_heart.erl"},
                                         {line,116}]},
                                       {gen_server,try_dispatch,4,
                                        [{file,"gen_server.erl"},
                                         {line,637}]}]}
[ns_server:error,2022-03-16T10:38:04.930Z,ns_1@172.17.0.2:<0.450.0>:prometheus:post_async:188]Prometheus http request failed:
URL: http://127.0.0.1:9123/api/v1/query
Body: query=cm_failover_safeness_level%7Bbucket%3D%60******************%60%7D%5B20s%5D&timeout=5s
Reason: <<"Service Unavailable">>
[ns_server:error,2022-03-16T10:38:04.939Z,ns_1@172.17.0.2:<0.473.0>:prometheus:post_async:188]Prometheus http request failed:
URL: http://127.0.0.1:9123/api/v1/query
Body: query=cm_failover_safeness_level%7Bbucket%3D%60contentful-testing%60%7D%5B20s%5D&timeout=5s
Reason: <<"Service Unavailable">>
[ns_server:error,2022-03-16T10:38:05.051Z,ns_1@172.17.0.2:<0.510.0>:prometheus:post_async:188]Prometheus http request failed:
URL: http://127.0.0.1:9123/api/v1/query
Body: query=%7Bname%3D~%60kv_curr_items%7Ckv_curr_items_tot%7Ckv_mem_used_bytes%7Ccouch_docs_actual_disk_size%7Ccouch_views_actual_disk_size%7Ckv_ep_db_data_size_bytes%7Ckv_ep_bg_fetched%60%7D+or+kv_vb_curr_items%7Bstate%3D%27replica%27%7D+or+kv_vb_num_non_resident%7Bstate%3D%27active%27%7D+or+label_replace%28sum+by+%28bucket%2C+name%29+%28irate%28kv_ops%7Bop%3D%60get%60%7D%5B1m%5D%29%29%2C+%60name%60%2C%60cmd_get%60%2C+%60%60%2C+%60%60%29+or+label_replace%28irate%28kv_ops%7Bop%3D%60get%60%2Cresult%3D%60hit%60%7D%5B1m%5D%29%2C%60name%60%2C%60get_hits%60%2C%60%60%2C%60%60%29+or+label_replace%28sum+by+%28bucket%29+%28irate%28kv_cmd_lookup%5B1m%5D%29+or+irate%28kv_ops%7Bop%3D~%60set%7Cincr%7Cdecr%7Cdelete%7Cdel_meta%7Cget_meta%7Cset_meta%7Cset_ret_meta%7Cdel_ret_meta%60%7D%5B1m%5D%29%29%2C+%60name%60%2C+%60ops%60%2C+%60%60%2C+%60%60%29+or+sum+by+%28bucket%2C+name%29+%28%7Bname%3D~%60index_data_size%7Cindex_disk_size%7Ccouch_spatial_data_size%7Ccouch_spatial_disk_size%7Ccouch_views_data_size%60%7D%29&timeout=5s
Reason: <<"Service Unavailable">>
[ns_server:error,2022-03-16T10:38:05.073Z,ns_1@172.17.0.2:<0.524.0>:prometheus:post_async:188]Prometheus http request failed:
URL: http://127.0.0.1:9123/api/v1/query
Body: query=%7Bcategory%3D%60system-processes%60%2Cname%3D~%60sysproc_mem_resident%7Csysproc_mem_size%7Csysproc_cpu_utilization%7Csysproc_major_faults_raw%60%7D&timeout=5s
Reason: <<"Service Unavailable">>
[ns_server:error,2022-03-16T10:38:05.087Z,ns_1@172.17.0.2:<0.533.0>:prometheus:post_async:188]Prometheus http request failed:
URL: http://127.0.0.1:9123/api/v1/query
Body: query=%7Bcategory%3D%60system%60%2Cname%3D~%60sys_cpu_utilization_rate%7Csys_cpu_stolen_rate%7Csys_swap_total%7Csys_swap_used%7Csys_mem_total%7Csys_mem_free%7Csys_mem_limit%7Csys_cpu_cores_available%7Csys_allocstall%60%7D&timeout=5s
Reason: <<"Service Unavailable">>
[ns_server:error,2022-03-16T10:38:05.151Z,ns_1@172.17.0.2:<0.553.0>:prometheus:post_async:188]Prometheus http request failed:
URL: http://127.0.0.1:9123/api/v1/query
Body: query=cm_failover_safeness_level%7Bbucket%3D%60******************%60%7D%5B20s%5D&timeout=5s
Reason: <<"Service Unavailable">>
[ns_server:error,2022-03-16T10:38:05.171Z,ns_1@172.17.0.2:<0.556.0>:prometheus:post_async:188]Prometheus http request failed:
URL: http://127.0.0.1:9123/api/v1/query
Body: query=cm_failover_safeness_level%7Bbucket%3D%60contentful-testing%60%7D%5B20s%5D&timeout=5s
Reason: <<"Service Unavailable">>
[ns_server:error,2022-03-16T10:38:05.245Z,ns_1@172.17.0.2:service_status_keeper_worker<0.579.0>:rest_utils:get_json:57]Request to (indexer) getIndexStatus with headers [] failed: {error,
                                                             {econnrefused,
                                                              [{lhttpc_client,
                                                                send_request,
                                                                1,
                                                                [{file,
                                                                  "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/lhttpc/lhttpc_client.erl"},
                                                                 {line,220}]},
                                                               {lhttpc_client,
                                                                execute,9,
                                                                [{file,
                                                                  "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/lhttpc/lhttpc_client.erl"},
                                                                 {line,169}]},
                                                               {lhttpc_client,
                                                                request,9,
                                                                [{file,
                                                                  "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/lhttpc/lhttpc_client.erl"},
                                                                 {line,
                                                                  93}]}]}}
[ns_server:error,2022-03-16T10:38:05.246Z,ns_1@172.17.0.2:service_status_keeper-index<0.580.0>:service_status_keeper:handle_cast:103]Service service_index returned incorrect status

Thanks, Jakub

Hello,

Would you be able to share your query.log ? - There must be a reason for the query service process to still be running yet no longer accepting requests and I’d hope there’d be something logged to indicate why.

I presume it isn’t directly accessible when in this state? e.g. curl -su uid:pwd http://localhost:8093/admin/vitals