Using collections to organize data has many benefits including:
Ability to refer to similar documents as a unit for various purposes such as building an index, setting up replication, querying, backup/restore etc.
Does it also provides ability to configure hardware resource at collection level as well? or at Scope level? I mean configuring Working dataset percent at Scope/collections level, so that an admin can allocate predefined hardware resource (RAM) for particular Scope or Collection .
It’s good to know that up-to 30 buckets would be supported in Couchbase 6.5.
In general a bucket is still the unit of physical management (RAM Quota, number of replicas, eviction policy, disk paths etc) - a scope / collection is a second (and third) level logical management concept.
Well, it depends on how much separation you want between tenants - I probably should have said “buckets are the unit of resource management”, not _physical_manangement.
Data in different collections / scopes has it’s own namespace, so documents are independent. Data within buckets is still on the same physical machine, running in the same process, so it’s not like different buckets are “physically” separate in the hardware sense.
@drigby, With introduction of scopes/collections resource management has been taken to much granular level than only buckets like RBAC, indexes are now applicable at scopes/collections level as well.
So, i just want to understand the pain of moving other resource management (mostly at hardware level) to scopes/collections level.
Part of the point of collections is they are a lighter-weight abstraction than buckets - they don’t need their own dedicated disk files, individual memory quotas, individual background tasks, etc. That’s how you can have ~1000s of collections without having 1000x of the above resources to manage.
What management layer are available at the scope and collection level?
Can I restrict access to some users depending of the scope? Collection?
Can I setup an Xdcr for a specific scope only? Collection only?
At the moment, collections are a developer preview feature. We have some thoughts and input on what people would like to see with access controls at the scope/collection level. It’s still under development though.
Can I turn that question back on you? What would you like to see @Benny? What can you tell us about the use case?
What we’ve heard thus far is that people may have many different instances of a microservice, each with different data and configuration. The scope addresses this nicely. Then, for example, the resources of a bucket can be shared with lots of different tenants, each with their own scope. That allows, as an example, multiple versions of that microservice to each have a “users” collection. Clearly there’d always be an access control requirement, but in getting to high density and high scale, maybe you’d allow for shared resources… and thus allow noisy neighbors… with ability to observe and adjust later.
Question about GSI-Indexes on Collection/Scope Level:
What about Consistency ( specifically REQUEST_PLUS)?
Example
I do a large bulk-import for Tenant A (e.g. 1 Mio products)
at the same time I do a N1QL query (Consistency=Request_plus) for Tenant B which utilizes an GSI-index (which is also affected by Tenant A)
At the moment (CB 6.0) Tenant Bis affected by Tenant A: the Request_plus-query needs to wait for the indexes to be updated by the large import of Tenant A.
Question:
In the new world: Will Tenant B still be affected by the large import of Tenant A (when A and B live in separate collections / scopes inside the same bucket?)
How do indexes need to be created? Do I need to create the same index per Collection/scope or are indexes automatically handled per scope when defined globally just once?
In other words: What we are looking for is a separation of Tenant A and Tenant B, which live in the same bucket, but different Collections/Scopes. I would like to understand how this affects indexes and consistency.
Thank you,
So on my side what I would like to have from the scope/collections. I can today in a highly concurent multitenant system have “isolation” of my data by correctly naming them. All key always starts with the scope and collection, that way I can have the same user on multiple service for multiple scope.
My problem is all the management tool work at the bucket level. Memory allocation, XDCR, view and indexes.
My biggest need would be to have XDCR for a specific scope only, and maybe collection.
After to configure a index only for some specific scope and some specific collection.
Permission restricted to access only one scope is not that useful for me today.
Dream would be to be able to configure nodes have affinity with some buckets and scopes.
The plan is that in 7.0 you will be able to control access using role assignments at the collection level. Indexes will also be available at the collection level. There will not be scope- or bucket-level indexes. You will be able to set up XDCR per collection.
Resource allocation will remain at the bucket level. Scopes and collections have soft boundaries, not hard ones; you will definitely hear the neighbors through the walls.
Views are a question mark for now. They are distinctly old and deprecated technology. It’s possible we will only support them at the bucket level, which translates to the default collection of the bucket in a collection-based world. Generally speaking we encourage users to move from views to either N1QL or Analytics.
Hi Benny,
Yes, the plan is to support XDCR replication setup at scope level or at individual collection level. Setting it up at scope level would automatically include all the collections in the scope. We are still designing the feature and working through details such as what should happen when collections are added to the scope or removed from the scope.
With collections, in your example Tenant B will not be affected by the large import of Tenant A. This is because now the index will wait for the high-seqno of the collection instead of the high-seqno of the whole bucket.
Note that request_plus uses high-seqno behind the scenes.
note, to add onto Shivani’s response, if concurrent updates occur to tenant A and tenant B collections, but A receives a bulk update yet B receives a single update, it could happen that the Tenant B request_plus query gets affected, this would be dependent on the total ordering of the concurrent update and query execution. This occurs because collections are not fully independent of each other.
Yes, you will be able to get stats on collections though not as comprehensive as for buckets. We are still trying to finalize the list of stats that are most useful at collection level. What do you need in addition to data-size in bytes?
We will also have number of gets/sets per collection, number of documents per collection, memory resident size per collection.
Noted your requirement for getting these via n1ql (@keshav_m) and also getting them at scope level.