Multiple different indexes per tenant - How?

Hi,

In our Cloud/SaaS/Multi-Tenant solution each customer can create multiple datastores.

  • A datastore is something like a table where you can put records with rows and columns (like a CSV or a Google Spreadsheet).
  • each record has a unique identifier within the datastore
  • each row is stored as a JSON-payload (thus we think Couchbase is a good fit)
  • the customer can query each datastore to build custom reports etc. (thus we think N1QL is pretty awesome)

What we want to do

  • the customer should be able to choose the fields which should be indexed
  • the reason is that each datastore has a different purpose so different “questions” will be asked, hence it makes sense to index different fields

Question

  • is it possible (or planned for the future) to add indexes on the fly which are bound only to a sub-set of the data? e.g. based on a datastore-id / tenant-id?
  • our goal is that each index is only populated / touched for the specific datastore/tenant it belongs to
  • as we understand it, if we use a global GS1 index for this, then this index applies for the whole bucket. But we need it indexes which apply only for a certain portion of documents (in other words: an insert/update of a doc of datastore-A should not affect the indexes of datastore-B)

Background

We are currently evaluating Couchbase because we are searching exactly a functionality like this. We also thought about maybe having an ElasticSearch-index per datastore/tenant but it would be cool if this would be possible in Couchbase directly, as there are other cool features.

1 Like

Hi @synesty, Thanks for the kind words on Couchbase & N1QL.

on the questions:

  1. Yes. You can create index on subset of data based on expression.
    CREATE INDEX i1 ON info(x,y,z) WHERE customer = “joe” and tenant = “2983” ;

  2. Your assumption is correct.

  3. They’re called “global” in the sense that they can index data from ANY of the data nodes. But the index themselves will contain the subset of the data specified by the WHERE condition of the CREATE INDEX.

Couchbase indexing is quite flexible. Please take a look at this article:
https://dzone.com/articles/index-first-and-query-faster
and the recent talks at Couchbase Connect at http://www.couchbase.com

1 Like

Hi @keshav_m,
thanks, this is great news. “Partial index” is the magic term I was looking for.

Now we will investigate how this will scale, as the number of indexes can become quite large - basically unbound.

sure… you can use partial indexes, which is basically an index created on subset of documents in a bucket. See CREATE INDEX syntax… You can provide filters with where-clause.

You will scale the number of indexes by scaling the number of index nodes. You can designate nodes as index nodes. We call it multidimensional scaling. You can scale data, index, and query nodes independently.

1 Like

Do you see any problem in the approach of CREATING INDEX on the fly when an index will be needed (for a analytics-like query “Top 10 best-selling products”) and DROP the index afterwards? The idea is to keep the number of indexes to a minimum (to not waste precious index memory). The indexes are only needed short-term for some queries which e.g. run hourly or nightly.

The use case:
Assume a Dashboard with several graphs (e.g. a pie chart with TOP-N products, TOP-N cities etc.). Each graph could be the result of a N1QL query. The number of different graphs/queries can be very high, thus the amount of needed indexes is high too. But the indexes are not needed all the time. We can fetch and persist the query-results and then the indexes are not needed anymore e.g. until the night, when the Dashboards are created.
The amount documents is always between 100k and a few 1mio items. So N1QL queries are only fast if they hit a covering index. We thought about MapReduce-Views too, but like in this topic Top N over aggregated (MapReduced) we cannot sort by the value (_count or _sum).

It seems that this “Getting the TOP-N documents grouped by some attribute” is a pretty hard use case for CB if you are dealing with a large number of documents and a large number of different JSON-structures (which requires lots of different indexes).

Any ideas would be appreciated.

CB 5.0.0 beta2

Creating the index dynamically is fine. You’d need to size the nodes to handle the additional indexes in terms of memory, storage and cpu. During the index creation, projector will also have additional work. Presumably, you’re creating partial index on documents for a given time/other range.

/cc @deepkaran.salooja @john

Ok thanks. I think we have to try it out and gather some metrics.

Hey @synesty,

We have a case where the requirements are similar with your use case. Have you had any chance to measure the performance impact?

No unfortunatelly not. We are in the still in the process of introducing CB (5.0) and we currently go with a fixed set of GSI indexes, which is the same for all customers. Not sure yet what our strategy for customer-specific indexes will be. Maybe we solve that with a different technology (e.g. ElasticSearch or Couchbase’s Fulltext Search). But that is in the future, once we know that we will continue building on CB. In our scenario in step 1 we are replacing MySQL with CB for one really specific part (not CB for the whole application). If that works well we will check out how to do more things.

1 Like