Build Highly Scalable AI/ML Applications With Couchbase and PySpark

We are excited to announce the General Availability (GA) of the Python support for Couchbase Spark Connector, bringing first-class integration between Couchbase Server and Apache Spark to Python data engineers. This GA release means the connector is production-ready and fully supported, enabling PySpark applications to seamlessly read from and write to Couchbase. With Couchbase’s high-performance NoSQL database (with SQL++/SQL++ query language) and Spark’s distributed processing engine, data engineers can now easily combine these technologies to build fast, scalable data pipelines and analytics workflows. In short, the Couchbase Spark Connector for PySpark unlocks efficient, parallel data integration – allowing you to leverage Spark for ETL/ELT, real-time analytics, machine learning, and more on data stored in Couchbase.

In this post, we’ll cover how to get started with the PySpark connector, demonstrate basic read/write operations (both key-value and query-based) for both Couchbase operational database and Capella Columnar databases; and share performance tuning tips to get the best throughput. Whether you’ve been using the Couchbase Spark Connector in Scala, or if you’re new to Couchbase-Spark integration, this guide will help you quickly ramp up using PySpark for your data engineering needs.

Why PySpark?

Adding PySpark support to the existing Couchbase Spark Connector was driven by the growing demand from data engineers and developers who prefer Python for its simplicity and massive Python ML ecosystem for Spark in data science and engineering workflows. This support ensures that teams already using Python can now integrate Couchbase (whether you are using Couchbase Capella (DBaaS), self managed operational database or Capella Columnar database) into Python-based Spark workflows, enabling broader adoption and streamlined data processes.

Python’s dominance in AI/ML use cases, supported by frameworks such as SparkML, PyTorch, TensorFlow, H2O, DataRobot, scikit-learn, and SageMaker, as well as popular exploratory data analysis tools like Matplotlib and Plotly, further underscores the necessity for PySpark integration. Additionally, PySpark compatibility unlocks accelerated ETL and ML pipelines leveraging GPU acceleration (Spark RAPIDS) and facilitates sophisticated feature engineering and data wrangling tasks using widely adopted libraries such as Pandas, NumPy, and Spark’s built-in feature engineering APIs. This new support significantly streamlines data processes and expands adoption opportunities for Couchbase in data science and engineering teams.

Getting started with Couchbase PySpark

Getting started is straightforward. The Couchbase Spark Connector is distributed as a single JAR (Java archive) that you add to your Spark environment. You can obtain the connector from the official Couchbase download site or via Maven coordinates. Once you have the JAR, using it in PySpark is as simple as configuring your Spark session with the connector and Couchbase connection settings.

1. Get or create a Couchbase operational database or Capella Columnar database

The fastest way to start with Couchbase is to use our Capella DBaaS. Once there, you can either find your existing database or create an operational or columnar (for analytics) database. Alternatively, you can use our self managed Couchbase.

2. Install PySpark (if not already)

If you are working in a Python environment, install PySpark using pip. For example, in a virtual environment:

pip install pyspark

1	pip install pyspark

This will install Apache Spark for use with Python. If you’re running on an existing Spark cluster or Databricks, PySpark may already be available.

3. Include the Couchbase Spark Connector JAR

Download the spark-connector-assembly-<version>.jar for the latest connector release. Then, when creating your Spark session or submitting your job, provide this JAR in the configuration. You can do this by setting the --jars option in spark-submit or via the SparkSession builder in code (as shown below).

4. Configure the Couchbase connection

You need to specify the Couchbase cluster connection string and credentials (username and password). In Capella, you can find this on the “Connect” tab for operational and Settings->Connection String for columnar. Optionally, specify a default bucket or scope if needed (though you can also specify bucket/scope per operation).

Below is a quick PySpark example that sets up a SparkSession to connect to a Couchbase cluster and then reads some data:

from pyspark.sql import SparkSession
# Initialize SparkSession with Couchbase connector and connection settings

spark = SparkSession.builder \
    .appName("CouchbaseIntegrationExample") \
    .master("local[*]") \  # using local Spark for example; omit or adjust for Spark cluster
    .config("spark.jars", "/path/to/spark-connector-assembly-<version>.jar") \
    .config("spark.couchbase.connectionString", "couchbases://<YOUR_CLUSTER_HOSTNAME>") \
    .config("spark.couchbase.username", "<USERNAME>") \
    .config("spark.couchbase.password", "<PASSWORD>") \
    .getOrCreate()

# Test the connection by reading a few documents from Couchbase (using a sample bucket)
df = spark.read.format("couchbase.query") \
    .option("bucket", "bucket_name") \
    .option("scope", "scope_name") \
    .option("collection", "collection_name") \
    .load()

df.printSchema()
df.show(5)

from pyspark.sql import SparkSession

# Initialize SparkSession with Couchbase connector and connection settings

spark = SparkSession.builder \

.appName("CouchbaseIntegrationExample") \

.master("local[*]") \ # using local Spark for example; omit or adjust for Spark cluster

.config("spark.jars", "/path/to/spark-connector-assembly-<version>.jar") \

.config("spark.couchbase.connectionString", "couchbases://<YOUR_CLUSTER_HOSTNAME>") \

.config("spark.couchbase.username", "<USERNAME>") \

.config("spark.couchbase.password", "<PASSWORD>") \

.getOrCreate()

# Test the connection by reading a few documents from Couchbase (using a sample bucket)

df = spark.read.format("couchbase.query") \

.option("bucket", "bucket_name") \

.option("scope", "scope_name") \

.option("collection", "collection_name") \

.load()

df.printSchema()

df.show(5)

In the code above, we configure the Spark session to include the Couchbase connector JAR and point it to a Couchbase cluster. We then create a DataFrame df by reading from the bucket_name bucket (specifically the scope_name.collection_name collection) via the Query service.

For the rest of this document, we are assuming you have loaded our sample dataset travel-sample which can be done for Couchbase Capella operational or Columnar very easily.

Read/write to Couchbase using PySpark

Once your Spark session is connected to Couchbase, you can perform both key-value operations (for writes) and query operations (using SQL++ for both read and writes) through DataFrames.

Following table shows the format Sparks connector supports to read and write to Couchbase and columnar databases:

Couchbase/Capella operational database Capella Columnar database

Read operations read.format("couchbase.query") read.format("couchbase.columnar")

Write operations

(recommended to use Data Service)

write.format("couchbase.kv")

write.format("couchbase.query")

write.format("couchbase.columnar")

Reading from Couchbase with a Query DataFrame

The Couchbase Spark Connector allows you to load data from a Couchbase bucket as a Spark DataFrame via SQL++ queries. Using the DataFrame reader with format couchbase.query, you can specify a bucket (and scope/collection) and optional query parameters. For example, to read all documents from a collection or a subset defined by a filter:

# Read all documents from a Couchbase collection using the Query service
airlines_df = spark.read.format("couchbase.query") \
    .option("bucket", "travel-sample") \
    .option("scope", "inventory") \
    .option("collection", "airline") \
    .load()

# Example: filter the DataFrame using Spark (will push down to Couchbase where possible)
usa_airlines_df = airlines_df.filter("country = 'United States'")
usa_airlines_df.show(5)

# Read all documents from a Couchbase collection using the Query service

airlines_df = spark.read.format("couchbase.query") \

.option("bucket", "travel-sample") \

.option("scope", "inventory") \

.option("collection", "airline") \

.load()

# Example: filter the DataFrame using Spark (will push down to Couchbase where possible)

usa_airlines_df = airlines_df.filter("country = 'United States'")

usa_airlines_df.show(5)

In this example, airlines_df loads all documents from the travel-sample.inventory.airline collection into a Spark DataFrame. We then apply a filter to find airlines based in the United States. The connector will attempt to push down filters to Couchbase so that unnecessary data isn’t transferred (i.e. it will include the WHERE country = 'United States' clause in the SQL++ query it runs, if possible). The result, usa_airlines_df, can be used like any other DataFrame in Spark (for example, you could join it with other DataFrames, apply aggregations, etc.).

Under the hood, the connector partitions the query results into multiple tasks if configured (more on this in Performance Tuning below), and uses Couchbase’s Query service (powered by the SQL++ engine) to retrieve the data. Each Spark partition corresponds to a subset of data retrieved by an equivalent SQL++ query. This allows parallel reads from Couchbase, leveraging the distributed nature of both Spark and Couchbase.

Writing to Couchbase with Key-Value (KV) operations (recommended)

The connector also supports writing data to Couchbase, either via the Data service (KV) or via the Query service (executing SQL++ INSERT/UPSERT commands for you). The recommended way for most use cases is to use the Key-Value data source (format("couchbase.kv")) for better performance. In key-value mode, each Spark task will write documents directly to Couchbase data nodes.

When writing a DataFrame to Couchbase, you must ensure there is a unique ID for each document (since Couchbase requires a document ID). By default, the connector looks for a column named __META_ID (or META_ID in newer versions) in the DataFrame for the document ID. You can also specify a custom ID field via the IdFieldName option.

For example, suppose we have a Spark DataFrame new_airlines_df that we want to write to Couchbase. It has a column airline_id that should serve as the Couchbase document key, and the rest of the columns are the document content:

# Assume new_airlines_df is a DataFrame we want to write to Couchbase
# It contains an "airline_id" column to use as the document ID.
new_airlines_df.write.format("couchbase.kv") \
    .option("bucket", "mybucket") \
    .option("scope", "myscope") \
    .option("collection", "airlines") \
    .option("idFieldName", "airline_id") \
    .save()

# Assume new_airlines_df is a DataFrame we want to write to Couchbase

# It contains an "airline_id" column to use as the document ID.

new_airlines_df.write.format("couchbase.kv") \

.option("bucket", "mybucket") \

.option("scope", "myscope") \

.option("collection", "airlines") \

.option("idFieldName", "airline_id") \

.save()

Writing to Couchbase with Query (SQL++) operations

While we recommend using the Data service (KV) as above as it is typically faster than Query service, if you prefer, you can also write via the Query service by using format("couchbase.query") on write. This will internally execute SQL++ UPSERT statements for each row. This may be useful if you need to leverage a SQL++ feature (for example, server-side transformations), but for straightforward inserts/updates, the KV approach is more efficient.

df.write.format("couchbase.query") \
    .option("bucket", "mybucket") \
    .option("scope", "myscope") \
    .option("collection", "airlines") \
    .mode("overwrite") \
    .save()

df.write.format("couchbase.query") \

.option("bucket", "mybucket") \

.option("scope", "myscope") \

.option("collection", "airlines") \

.mode("overwrite") \

.save()

In the next section, let us modify these basic read/write cases for Couchbase’s latest analytics product – Capella Columnar.

PySpark support for Capella Columnar

One of the key new features in the Couchbase Spark Connector GA is Capella Columnar support. Capella Columnar is a JSON-native analytical database service in Couchbase Capella that stores data in a column-oriented format for high-performance analytics

Reading Columnar-Formatted Data with PySpark

Reading data from a Couchbase Capella Columnar cluster in PySpark is similar to couchbase operational cluster except three changes:

Use the format("couchbase.columnar") to specify connection is for columnar service.
The connection string for columnar can be retrieved from Capella UI.
You will also specify which dataset to load by providing the database, scope, and collection names (analogous to bucket/scope/collection in Couchbase) as options

Once Spark is configured, you can use the Spark DataFrame reader API to load data from the columnar service:

from pyspark.sql import SparkSession
# Initialize SparkSession with Couchbase configs (assuming connector jar is available)
spark = SparkSession.builder \
    .appName("Couchbase Spark Connector Columnar Example") \
    .config("spark.couchbase.connectionString", "couchbases://your.columnar.connection.string") \
    .config("spark.couchbase.username", "YourColumnarUsername") \
    .config("spark.couchbase.password", "YourColumnarPassword") \
    .getOrCreate()

# Read a DataFrame from Couchbase Capella Columnar (travel-sample.inventory.airline collection)
airlines_df = spark.read.format("couchbase.columnar") \
    .option("database", "travel-sample") \
    .option("scope", "inventory") \
    .option("collection", "airline") \
    .load()

from pyspark.sql import SparkSession

# Initialize SparkSession with Couchbase configs (assuming connector jar is available)

spark = SparkSession.builder \

.appName("Couchbase Spark Connector Columnar Example") \

.config("spark.couchbase.connectionString", "couchbases://your.columnar.connection.string") \

.config("spark.couchbase.username", "YourColumnarUsername") \

.config("spark.couchbase.password", "YourColumnarPassword") \

.getOrCreate()

# Read a DataFrame from Couchbase Capella Columnar (travel-sample.inventory.airline collection)

airlines_df = spark.read.format("couchbase.columnar") \

.option("database", "travel-sample") \

.option("scope", "inventory") \

.option("collection", "airline") \

.load()

In this example, the resulting airlines_df is a normal Spark DataFrame — you can inspect it, run transformations, and perform actions like .count() or .show() as usual. For instance, airlines_df.show(5) will print a few airline documents, and airlines_df.count() will return the number of documents in the collection. Under the hood, the connector automatically infers a schema for the JSON documents by sampling up to a certain number of records (by default 1000). All fields that consistently appear in the sampled documents become columns in the DataFrame, with appropriate Spark data types.

Note that if your documents have varying schemas, the inference might produce a schema that includes the union of all fields (fields not present in some documents will be null in those rows). In cases where the schema is evolving or you want to restrict which records are considered, you can provide an explicit filter (predicate) to the reader, as described next.

Querying a Columnar Dataset in Couchbase via Spark

Often you may not want to load an entire collection, especially if it’s large. You can optimize performance by pushing down filter predicates directly to the Capella Columnar service when loading data, avoiding unnecessary data transfer. Use .option("filter", "") to apply a SQL++ WHERE clause during the read operation. For instance, to load only airlines based in the United States:

usa_airlines_df = spark.read.format("couchbase.columnar") \
    .option("database", "travel-sample") \
    .option("scope", "inventory") \
    .option("collection", "airline") \
    .option("filter", "country = 'United States'") \
    .load()

print(usa_airlines_df.count())  # Only airlines where country = 'United States'

usa_airlines_df = spark.read.format("couchbase.columnar") \

.option("database", "travel-sample") \

.option("scope", "inventory") \

.option("collection", "airline") \

.option("filter", "country = 'United States'") \

.load()

print(usa_airlines_df.count()) # Only airlines where country = 'United States'

The connector executes this filter directly at the source, retrieving only relevant documents. You can also push down projections (selecting specific fields) and aggregations in some cases – the connector will offload simple aggregates like COUNT, MIN, MAX, and SUM to the Columnar engine whenever possible, rather than computing them in Spark, for better performance

Once data is loaded into a DataFrame, you can perform standard Spark transformations, joins, and aggregations. For example, to count airlines per country using Spark SQL, you can even create a temporary view to run Spark SQL queries on the data as follows:

airlines_df.createOrReplaceTempView("airlines_view")
result_df = spark.sql("""
    SELECT country, COUNT(*) AS airline_count
    FROM airlines_view
    GROUP BY country
    ORDER BY airline_count DESC
""")
result_df.show(10)

airlines_df.createOrReplaceTempView("airlines_view")

result_df = spark.sql("""

SELECT country, COUNT(*) AS airline_count

FROM airlines_view

GROUP BY country

ORDER BY airline_count DESC

""")

result_df.show(10)

This query runs entirely within Spark engine, giving flexibility to integrate Couchbase data seamlessly into complex analytical workflows.

Having covered basic reads and writes, let’s move on to how you can tune performance when moving large volumes of data between Couchbase and Spark.

Performance tuning tips

To maximize throughput and efficiency when using the Couchbase PySpark Connector, consider the following best practices.

Tuning your read operations

Use Query Partitioning for parallelism
(Couchbase Capella (DBaaS), self managed operational database or Capella Columnar)

When reading via the Query service for operational or columnar database, take advantage of the connector’s ability to partition the query results. You can specify a partitionCount (and a numeric partitioning field with lower/upper bounds) for the DataFrame read. A good rule of thumb is to set partitionCount to at least the total number of query service CPU cores available in your Couchbase cluster. This ensures Spark will run multiple queries in parallel, leveraging all query nodes. For example, if your Couchbase cluster’s Query service has 8 cores in total, set partitionCount >= 8 so that at least 8 parallel SQL++ queries will be issued. This can dramatically increase read throughput by utilizing all query nodes concurrently. Note that you must have enough cores in your Spark cluster as well to run that many parallel queries.

Leverage covering indexes for query efficiency
(Couchbase Capella (DBaaS), self managed operational database)

If using SQL++ queries, try to query through covering indexes whenever possible. A covering index is an index that includes all fields your query needs, so the query can be served entirely from the index without fetching from the data service. Covered queries avoid the extra network hop to fetch full documents, thus delivering better performance. Design your Couchbase secondary indexes to include the fields you filter on and the fields you return, if feasible. This might mean creating specific indexes for your Spark jobs that cover exactly the data needed.

Ensure index replicas to avoid bottlenecks
(Couchbase Capella (DBaaS), self managed operational database)

Along with using covering indexes, make sure your indexes are replicated across multiple index nodes. Index replication not only provides high availability, but also allows queries to be load-balanced across index copies on different nodes for higher throughput. In practice, if you have (for example) 3 index nodes, replicating important indexes across them means the Spark connector’s parallel queries can hit different index nodes rather than all pounding a single node.

Tuning your write operations

Prefer the Data service for bulk writes
(Couchbase Capella (DBaaS), self managed operational database)

We recommend to use the key-value data source (Data service) rather than the Query service for write operations. Writing through the Data service (direct KV upserts) is typically several times faster than doing SQL++-based inserts. In fact, internal benchmarks have shown writing via KV can be around 3x faster than using SQL++ in Spark jobs. This is because the Data service can ingest documents in parallel directly to the nodes responsible, with lower latency per operation. Note that you have indices updated separately, if needed, for those new documents, as KV writes won’t automatically trigger index updates beyond the primary index.

Increase write partitions for Query service writes
(Couchbase Capella (DBaaS), self managed operational database)

While not recommended, if you decide to use couchbase.query for writing (for example, if performing a server side transformations while writing) , optimize the performance by using a high number of write partitions. You can repartition your DataFrame before writing so that Spark runs many concurrent write tasks. A rough guideline is to use on the order of hundreds of partitions for large scale writes via SQL++. For instance, using about 128 partitions per Query node CPU is a starting point some users have found effective. This means if you have 8 query cores, try ~1024 partitions. The idea is to flood the query service with enough parallel UPSERT statements to maximize throughput. Be cautious and find the right balance for your cluster – too high concurrency could overload the query service. Monitor Couchbase’s query throughput and adjust accordingly.

By following these tuning tips – aligning partition counts with cluster resources, indexing smartly, and choosing the right service for the job – you can achieve optimal performance for Couchbase-Spark integration. Keep an eye on both Spark’s job metrics and Couchbase’s performance stats (available in the Couchbase UI and logs) to identify any bottlenecks (e.g., if one query node is doing all the work, or if the network is saturated) and adjust the configuration as needed.

Community and support

Couchbase PySpark support is built upon Couchbase Spark Connector for Couchbase and is open-source, and we encourage you to contribute, provide feedback, and join the conversation. you can access our comprehensive documentation, join the Couchbase Forums or Couchbase Discord.

Platform

Self-Managed

Services

Capabilities

Why Couchbase?

Migrate to Capella

By Use Case

By Industry

By Application Need

Popular Docs

By Developer Role

Quickstart

Resource Center

About

Partnerships

Our Services

Partners: Register a Deal

Ready to register a deal with Couchbase?

Marriott

All Posts

Build Highly Scalable AI/ML Applications With Couchbase and PySpark

Why PySpark?

Getting started with Couchbase PySpark

Read/write to Couchbase using PySpark

Reading from Couchbase with a Query DataFrame

Writing to Couchbase with Key-Value (KV) operations (recommended)

Writing to Couchbase with Query (SQL++) operations

PySpark support for Capella Columnar

Reading Columnar-Formatted Data with PySpark

Querying a Columnar Dataset in Couchbase via Spark

Performance tuning tips

Tuning your read operations

Tuning your write operations

Community and support

Further reading

Author

Posted by Vishal Dhiman, Sr. Product Manager

Leave a reply Cancel reply

Platform

Self-Managed

Services

Capabilities

Why Couchbase?

Migrate to Capella

By Use Case

By Industry

By Application Need

Popular Docs

By Developer Role

Quickstart

Resource Center

About

Partnerships

Our Services

Partners: Register a Deal

Ready to register a deal with Couchbase?

Marriott

All Posts

Build Highly Scalable AI/ML Applications With Couchbase and PySpark

Why PySpark?

Getting started with Couchbase PySpark

Read/write to Couchbase using PySpark

Reading from Couchbase with a Query DataFrame

Writing to Couchbase with Key-Value (KV) operations (recommended)

Writing to Couchbase with Query (SQL++) operations

PySpark support for Capella ​​Columnar

Reading Columnar-Formatted Data with PySpark

Querying a Columnar Dataset in Couchbase via Spark

Performance tuning tips

Tuning your read operations

Tuning your write operations

Community and support

Further reading

Author

Posted by Vishal Dhiman, Sr. Product Manager

Leave a reply Cancel reply

PySpark support for Capella Columnar