Looking for a highly efficient semi-structured document management system

Hi everyone,

We’re searching for a solution to efficiently manage large volumes of XML documents in an on-premises environment. We’re wondering if Couchbase engine could make our lives in some way easier.

• Documents are 10-30 KB each, with rare exceptions up to 2-3 MB.
• Ingestion rates. Peak - several million XML documents per hour. Normal - hundreds of thousands per hour.
• Read operations: up to tens of millions of document retrievals per hour, with filtering based on a dozen attributes. Queries returning single documents and batches of documents matching specific criteria. We don't need to understand the whole XML here, we just need to index a number of header attributes that are used for seaching. In addition to that we need the bility to retrieve large batches of documents (e.g., hourly/daily chunks) for data structuring and analysis (data export to another system).
• Performance: Write and read latency for a single document should be <50 ms. The system must scale (horizontally) to handle peak loads (millions of writes and reads per hour).
• Retention of several years, divided into "hot" data (last 2-3 months) and "cold" data (older, rarely accessed).

Current considerations and initial tests:
• Metadata (only the header attributes used for document searching) stored in an RDBMS for fast lookups/searches
• Documents physically stored in an object store (e.g., S3).
• Challenges: managing millions of hot document writes/reads to S3 in a short timeframe and performing large-scale reads for analytics.
Preference/Idea:
• A database engine capable of efficiently handling both metadata and “hot” documents, scalable across nodes. Cold data could go to S3 for long-term storage (e.g in an aggregated format) leaving corresponding metadata in the DB, and built-in support for managing these exports would be a big advantage.
• Would Couchbase be a good fit for this use case? We’re evaluating technologies for this project and would appreciate your input. Could Couchbase effectively handle the described workload? Which features or functionalities would be most relevant in this scenario? If there’s a similar success story or example it would be very helpful.

Hi @PitErr

So if my back-of-a-napkin maths is correct, you’re looking for roughly:

  • Peak ingestion of ~1k docs/sec.
  • Peak KV reads of ~25k docs/sec.
  • Read and write latencies <50ms.
  • Ingestion of perhaps 1TB of data a day.
  • Retention of 2-3 months (so perhaps 60-90TB) of data, after which it gets moved to cold storage.

If that’s all correct, then yes I believe Couchbase would be a great fit for you. I’d suggest contacting our sales team directly, as they have a lot of experience with sizing clusters for requirements, and may be able to give some info about similar deployments.