N1QL query giving inconsistent results

Hi all, We are seeing a very strange behavior in Couchbase while using the N1ql query.
Couchbase version: 6.0.2, Java client version: 2.5.9
We have Couchbase cluster with 3 nodes. 3 different buckets are present to hold different types of data. Issue is found only in one of the buckets. This bucket is having approximately 3 million records. A small java utility uses simple N1ql query to fetch that data. Below is the query.

SELECT productId FROM MyBucket USE KEYS ‘productName’;

No index is present on the bucket. This query is fetching either wrong result or Blank response most of the time it is executed even when proper data is present in the bucket. Tried executing the same query on Couchbase UI - Query Editor of all 3 nodes present in the cluster. Surprisingly 2 nodes always gave the proper result and one node gave different/inconsistent results every time we ran it.

More strange is that it is occurring only in Production environment and not in any lower environments. The same java utility was deployed in production 7 months back . First 6 months it worked very well with no issues and all of a sudden now this issue has come up.

Since the issue was in production and we could not find root cause, we tried restarting that particular node which had issue. Soon after restart the issue went off . Again it worked well for 20 days and now we are seeing same behavior, but this time the issue is in different node.

Could anyone please help us with this issue ?

6.0.2 is a little on the old side so if there is a bug, it likely has been addressed (it is end-of-life this month). Can you update to a more current version?

That said, this seems to me to more likely be a KV issue than a Query service issue, especially since the other data nodes are fine and the problem migrates. When the problem is occurring, do you get the same if you don’t use a query and just get the relevant documents by key directly? (You could check using the cbc tool or write a small Java test programme.)

I have not tried to fetch that key from cbc tool or java program, Instead I have tried searching for the document with that Id directly from the document search tab and I could see proper data inside that document.
This is happening for almost all the documents and not just with one or two.

Also, it is not possible to upgrade the Couchbase version for now. We will definitely consider this during our next release.

Any further suggestions will be very helpful to us.

I presume you have specific key values you’re using to test? If you SELECT * FROM ... USE KEYS... (i.e. * not a field list) do you get any content at all?

Do you have the query service on all 3 nodes? If multiple instances of the query service, does the combination of query node + data node-that-holds-the-document make a difference? (i.e. if you query from a “good” node for a document on a “bad” node, does that work any differently to issuing the same query from the “bad” node itself?) (I expect the answer is that you do have 3 instances of the Query service and it matters only that the Query node is the “bad” one, just confirming.)

Do you have Indexing configured at all ? (I know you haven’t got any indexes defined currently for this bucket.) – Could you create and compare the USE KEYS against a primary index? (WHERE meta().id = xxx) (This is just in case there is something up with the USE KEYS itself, however unlikely.)

Is there any indication of a failure or restart (other than your deliberate ones of course) in the query.log or memcached.log files ?

If this is your actual query, then just use the kv api.
If your query is something else, can you show the actual query?

If @dh suggestion did not work and working on other query nodes. Last option is restart query service node (simple as kill cbq-engine process, will restart automatically) and see if that helps

I have tried running this for many key values. At first I could see wrong result and after sometime it mostly was not giving any content at all. Result used to be like { } . And yes, we have 3 instances of query service and that query node is the bad one.
We could not find any such error in any of the log files.
I will try creating primary index next time the issue occurs. Any Idea on the root cause?

Yes, this is the actual query. As a workaround I will try using KV API. But any suggestions to find the root cause will help us a lot.

Also, we have another environment(DR) which is identical to prod. we have same couchbase setup in DR and have XCDR between DR and prod. I could see the same issue in DR as well. I had written a simple java program just to see what percentage of wrong data is retrieved from query. I used the same query here. During first run, I tried fetching 10k records and we got 500 blank responses({}) . then tried fetching the results for all 3 M records, but in between got some timeout exception(It was just a simple test code with no such exception handling). As I had not captured the results for 10k records during the first run, I ran it again for 10 k records. But this time strangely I did not see any blank response or wrong response. I tried running the query on UI and even there I could always see the proper result being fetched. Its very strange that how this issue got resolved on its own.Its been 7 days and no such issue is seen in DR as well.
Any suggestions on why would we see such behavior?

It is likely addressed by one of the many fixes that have gone into the product since 6.0.2; I would still recommend moving to a current version. Upgrading short-cuts any investigation eliminating a host of issues speeding the process greatly. Furthermore, I can’t see re-discovery of an issue being of great benefit as it’ll not be addressed in 6.0 in any event (if it hasn’t already been, in a later 6.0 version).

On the technical side, it seems likely that the Query connection to the KV is somehow not returning an error status and is being reused instead of discarded. Meaning that Query isn’t actually fetching anything but isn’t seeing it as a failure either. If a connection that hasn’t entered this problematic state is used, it retrieves the document as it should. Hence the inconsistency. Restarting means all new connections - hence all work again.

There have been lots of revisions to the Query-KV connections since 6.0, especially in 7.x.