OK, I’m really excited about the possibilities of using Couchbase 4 and N1QL, so I thought I would first get a feel for how the failover / redundancy technology handles it, and how things appear from the client side. I’m encountering a problem right off the bat, and it seems so simple that I feel I must be doing something wrong, so please, somebody slap me and tell me what I’ve messed up.
I installed a 2-node cluster of 4.0 Community Edition RC0 (build 4047). During the initialization of each node, I told it I wanted all the service checkboxes turned on, so all components should be running on both nodes. I created a Couchbase bucket, then I did a CREATE PRIMARY INDEX USING GSI. (This problem happens whether I give the primary index a name or leave it unnamed and have it called “#primary”.) I can then do a few UPSERTs and some very nice SELECTs, and it’s nice and fast, and I can N1QL query from either of the nodes and get the same results. All lovely.
But if I do a controlled Graceful Failover of either node, and keep querying over and over as the progress bar goes fills up, it works all the way until the progress bar reaches 100% and disappears; then, instantly, it’s as if the primary index is missing, or deleted, or was not copied from the vBucket it was supposed to during the failover, or something. I get
"code": 4000,
"msg": "No primary index on keyspace throttlen1ql. Use CREATE PRIMARY INDEX to create one."
in my JSON result, so of course all N1QL queries fail. If I Delta Recovery that node back into place, the index remains gone. If I recreate the index, then all my queries start working again, and it has not lost the actual data (the JSON documents themselves in the bucket). If I recreate the index while there’s just one node in, and then I bring the other node back in, then it continues to work fine. So it seems to be just something that happens right at the end of the graceful fail-out process. I also don’t think it happens every time, but I’ve failed out one node, then the other, and it has happened at least once for each of the two nodes, so I can’t believe it’s something corrupted on just one of the nodes.
Any ideas? I did formerly have the Beta version of 4.0 (which I believe was 4.0 Enterprise Beta) on these two servers, but I did carefully “dpkg --remove” it and I zapped the contents of /opt/couchbase before "dpkg --install"ing the release version. So I don’t think it’s any ghost data hanging around from before, but even if it is, this behavior is certainly unexpected and unwelcome. If anyone wants any logs, I can excerpt or upload anything that would help. Just looking over the “Logs” tab in the GUI, I don’t see anything I would call strange. Thanks, all!!
– Jeff Saxe
SNL Financial
Charlottesville, Virginia