Hello,
We work in a large European bank and we are starting to make tests with Couchbase in order to eventually migrate some of our SQL databases. We have initially inserted 5 million documents that look like this:
{
“fileName”: “au_fut_floor.dico”,
“uniqueID”: “000002”,
“mnemo”: “AA”,
“instrumentType”: “F”,
“strike”: 0,
“expiry”: 1464732000000,
“version”: “|”,
“feedType”: “10”,
“feedCode”: “SFE-AAM6”,
“feedName”: “SFE”,
“orderPassingCode”: “SFE-AAM6”,
“quotity”: 1,
“tickSizeRule”: “sfe-10-1Fut”,
“minQuantity”: 1,
“maxQuantity”: 0,
“additionalData”: “|”,
“serieCentraleFlag”: “-1”,
“lotSize”: 1,
“currency”: “AUD”,
“isinCode”: “|”,
“underlyingName”: “AA”,
“segmentName”: “|”,
“sectorName”: “|”,
“updateDate”: 1451499570578
}
Please note the “currency” field that always consists of a 3 character string.
Yesterday we executed what we thought was a simple N1QL Query using CBQ. Basically, we wanted to count the number of instruments that have currency = “EUR” :
SELECT COUNT(*) FROM dico WHERE currency="EUR"
Here are the query results along with performance numbers:
As you can see, it took 19 minutes to execute!! Now, please do keep in mind there is no index on this field, so a full scan was performed. Furthermore, here are the server hardware details:
OS: Windows(64bit) Server 2012
Processor: Intel Xeon E5-2690 v3 @ 2.60 GHz( 4 processors)
RAM: 32 GB
So my question is the following. Once we hit an important number of documents, let’s say > 1 million, are we supposed to have an index on every field we are going to search? I understand that doing a full document scan from disk is expensive, but 19 minutes? For comparison purposes, we ran the same query(count number of instruments with currency = “EUR”) on a MS SQL Server table that has very similar fields and 14 million records with no indices, it executed in 18 seconds. We later created the index on the currency field with Couchbase and the query executed in 16 seconds.
Evidently, going from 19m to 16s is quite a gain, but again, compared to the performance on MS SQL without index, far away from being impressive. I want to continue pushing a document technology here in our department because it adds a lot of flexibility to our applications and avoids tables with 110 columns for example. It could be the case that the speed of this type of queries will always be better with SQL: searching one column will always be faster than it’s NoSQL counterpart. But it’s a tradeoff against having documents with data from 5 different tables, therefore avoiding expensive JOIN operations ?
Please help me understand this performance and if you have any comments regarding the benefits-costs of replacing SQL with a document database(Couchbase) pertinent to the example(above) please don’t hesitate.