Looking for a highly efficient semi-structured document management system

PitErr · January 7, 2025, 1:15am

Hi everyone,

We’re searching for a solution to efficiently manage large volumes of XML documents in an on-premises environment. We’re wondering if Couchbase engine could make our lives in some way easier.

• Documents are 10-30 KB each, with rare exceptions up to 2-3 MB.
• Ingestion rates. Peak - several million XML documents per hour. Normal - hundreds of thousands per hour.
• Read operations: up to tens of millions of document retrievals per hour, with filtering based on a dozen attributes. Queries returning single documents and batches of documents matching specific criteria. We don't need to understand the whole XML here, we just need to index a number of header attributes that are used for seaching. In addition to that we need the bility to retrieve large batches of documents (e.g., hourly/daily chunks) for data structuring and analysis (data export to another system).
• Performance: Write and read latency for a single document should be <50 ms. The system must scale (horizontally) to handle peak loads (millions of writes and reads per hour).
• Retention of several years, divided into "hot" data (last 2-3 months) and "cold" data (older, rarely accessed).

Current considerations and initial tests:
• Metadata (only the header attributes used for document searching) stored in an RDBMS for fast lookups/searches
• Documents physically stored in an object store (e.g., S3).
• Challenges: managing millions of hot document writes/reads to S3 in a short timeframe and performing large-scale reads for analytics.
Preference/Idea:
• A database engine capable of efficiently handling both metadata and “hot” documents, scalable across nodes. Cold data could go to S3 for long-term storage (e.g in an aggregated format) leaving corresponding metadata in the DB, and built-in support for managing these exports would be a big advantage.
• Would Couchbase be a good fit for this use case? We’re evaluating technologies for this project and would appreciate your input. Could Couchbase effectively handle the described workload? Which features or functionalities would be most relevant in this scenario? If there’s a similar success story or example it would be very helpful.

graham.pople · January 7, 2025, 10:13am

Hi @PitErr

So if my back-of-a-napkin maths is correct, you’re looking for roughly:

Peak ingestion of ~1k docs/sec.
Peak KV reads of ~25k docs/sec.
Read and write latencies <50ms.
Ingestion of perhaps 1TB of data a day.
Retention of 2-3 months (so perhaps 60-90TB) of data, after which it gets moved to cold storage.

If that’s all correct, then yes I believe Couchbase would be a great fit for you. I’d suggest contacting our sales team directly, as they have a lot of experience with sizing clusters for requirements, and may be able to give some info about similar deployments.

Topic		Replies	Views
Can couchbase work if not all documents in DB can be stored in memory? Couchbase Server	8	2789	June 8, 2016
Large document performance Couchbase Server	4	2659	August 4, 2015
Quick Summary of couchbase and access Couchbase Server	1	772	July 18, 2018
What is the best document storage strategy in Couchbase? Couchbase Server	1	2954	October 15, 2014
Perfomance on CB 4 Couchbase Server	4	1712	January 21, 2016

Looking for a highly efficient semi-structured document management system

Related topics