As enterprises look to deploy production-ready AI agent applications, Large Language Model (LLM) observability has emerged as a critical requirement for ensuring both performance and trust. Organizations need visibility into how agents interact with data, make decisions, and retrieve information to maintain reliability, security, and compliance. Without proper observability, enterprises risk deploying models that produce inconsistent, inaccurate, or biased results, leading to poor user experiences and operational inefficiencies. The new partnership between Couchbase and Arize AI plays a vital role in bringing robust monitoring, evaluation, and optimization capabilities to AI-driven applications.

The integration of Couchbase and Arize AI delivers a powerful solution for building and monitoring Retrieval Augmented Generation (RAG) and agent applications at scale. By leveraging Couchbase’s high-performance vector database and the Arize AI observability platform and enhanced monitoring capabilities, enterprises can confidently build, deploy and optimize Agentic RAG solutions in production.

In this blog, we’ll walk through creating an Agentic RAG QA chatbot using LangGraph and the Couchbase Agent Catalog component of the recently announced Capella AI services (in preview), and evaluating and optimizing its performance with Arize AI. This is a tangible example of how Couchbase and Arize AI enable developers to enhance retrieval workflows, improve response accuracy, and monitor LLM-powered interactions in real time.

The Value of the Couchbase and Arize AI Partnership

By joining forces, Couchbase and Arize AI are revolutionizing how developers build and evaluate AI agent applications. Developers can construct sophisticated agent applications by leveraging Couchbase Capella as a single data platform for LLM caching, long-term and short-term agent memory, vector embedding use cases, analytics, and operational workloads along with their favorite agent development framework for orchestrating agent workflows.

Couchbase Agent Catalog further enhances this system by providing a centralized store for multi-agent workflows within an organization that allows for storage, management, and discovery of various agent tools, prompt versioning, and LLM trace debugging.

To ensure high reliability and transparency, Arize AI provides critical observability features, including:

    • Tracing Agent Function Calls: Arize enables detailed monitoring of the agent’s function calls, including retrieval steps and LLM interactions, to track how responses are generated.
    • Dataset Benchmarking: Developers can create a structured dataset to evaluate and compare agent performance over time.
    • Performance Evaluation with LLM as a Judge: Using built-in evaluators, Arize leverages LLMs to assess response accuracy, relevance, and overall agent effectiveness.
    • Experimenting with Retrieval Strategies: By adjusting chunk sizes, overlaps, and the number of retrieved documents (K-value), developers can analyze their impact on agent performance.
    • Comparative Analysis in Arize: The platform allows side-by-side comparisons of different retrieval strategies, helping teams determine the optimal configuration for their agent.

The Importance of LLM Observability

To ensure that AI applications perform well in production, enterprises need a robust evaluation framework. Observability tools like Arize AI allow developers to:

    • Assess LLM outputs based on factors such as relevance, hallucination rates, and latency
    • Conduct systematic evaluations to measure the impact of prompt changes, retrieval modifications, and parameter adjustments
    • Curate comprehensive datasets to benchmark performance across different use cases
    • Automate evaluation processes within CI/CD pipelines, ensuring consistent application reliability

Using an LLM as a judge, Arize AI allows developers to measure agent effectiveness using pre-tested evaluators, multi-level custom evaluation techniques, and large-scale performance benchmarking. By running thousands of evaluations, teams can iterate quickly and refine LLM prompts, retrieval methods, and agent workflows to improve overall application quality.

Building an Agentic RAG QA Chatbot

Agentic RAG combines the power of traditional retrieval-augmented generation with intelligent decision-making. In this implementation, we enable an LLM to dynamically decide whether retrieval is necessary based on the query context.

Arize AI for Agentic RAG with Couchbase

Illustration depicting the agent workflow from Langgraph’s agentic RAG example.

Step-by-Step Implementation

The rest of this blog is based on the accompanying tutorial notebook. Before building and deploying an observable AI agent, you’ll need to configure your development environment.

Prerequisites:

    1. To follow along with this tutorial, you’ll need to sign up for Arize and get your Space, API and Developer keys. You can see the guide here. You will also need an OpenAI API key.
    2. You’ll need to setup your Couchbase cluster by doing the following:
      1. Create an account at Couchbase Cloud
      2. Create a free cluster with the Data, Index, and Search services enabled*
      3. Create cluster access credentials
      4. Allow access to the cluster from your local machine
      5. Create a bucket to store your documents
      6. Create a search index
    3. Create tools and prompts required by agents using Couchbase Agent Catalog (for installation and more instructions, explore documentation here)

*The Search Service will be used to perform Semantic Search later when we use Agent catalog.


1) Create an Agentic RAG chatbot using LangGraph, Couchbase as the vector store and Agent Catalog to manage AI agents

Setting Up Dependencies

Connecting to Couchbase

We’ll use Couchbase as our vector store. Here’s how to set up the connection:

Document Ingestion

We’ll create a helper function to load and index documents with configurable chunking parameters:

Setting Up the Retriever Tool

Fetch our retriever tool from the Agent Catalog using the agentc provider. In the future, when more tools (and/or prompts) are required and the application grows more complex, Agent Catalog SDK and CLI can be used to automatically fetch the tools based on the use case (semantic search) or by name.

For instructions on how this tool was created and more capabilities of Agent catalog, please refer to the documentation here.

Defining the Agent State

We will define a graph of agents to help all involved agents communicate with each other better. Agents communicate through a state object that is passed around to each node and modified with output from that node.

Our state will be a list of messages and each node in our graph will append to it:

Creating Agent Nodes

We’ll define the core components of our agent pipeline:

Nodes: Relevance Checking Function, Query Rewriter, Main Agent, Response Generation

Building the Agent Graph

Now we’ll connect the nodes into a coherent workflow:

Visualizing the Agent Graph

Let’s visualize our workflow to better understand it:


2) Trace the agent’s function calls using Arize, capturing retrieval queries, LLM responses, and tool usage

Arize provides comprehensive observability for our agent system. Let’s set up tracing:

Now let’s run the agent to see how it works:

This will execute our agent graph and output detailed information for each node as it processes the query. In Arize, you’ll be able to see a trace visualization showing the execution flow, latency, and details of each function call.

Tracing visualization from Arize platform

Tracing Visualization from Arize Platform


3) Benchmark performance by generating a dataset with queries and expected responses

To systematically evaluate our system, we need a benchmark dataset:


4) Evaluate Performance Using LLM as a Judge

We’ll use LLM-based evaluation to assess the quality of our agent’s responses:


5) Experiment with Retrieval Settings

Now let’s experiment with different configurations to optimize our system:

Now we’ll run experiments with different configurations:


6) Compare Experiments in Arize

After running all the experiments, you can now view and compare them in the Arize UI. The experiments should be visible in your Arize workspace under the dataset name we created earlier.

Experiments comparison view from Arize Platform

In Arize, you can:

    1. Compare the overall performance metrics between different configurations
    2. Analyze per-question performance to identify patterns
    3. Examine trace details to understand execution flow
    4. View relevance and correctness scores for each experiment
    5. See explanations for evaluation decisions
    6. Evaluate outputs using an LLM as a judge to score response relevance and correctness.
    7. Optimize retrieval settings by experimenting with chunk sizes, overlap configurations, and document retrieval limits.
    8. Compare and analyze experiments in Arize to determine the best-performing configurations.

Innovate with Couchbase and Arize AI

The integration of Couchbase and Arize empowers enterprises to build robust, production-ready GenAI applications with strong observability and optimization capabilities. By leveraging Agentic RAG with monitored retrieval decisions, organizations can improve accuracy, reduce hallucinations, and ensure optimal performance over time.

As enterprises continue to push the boundaries of GenAI, combining high-performance vector storage with AI observability will be key to deploying reliable and scalable applications. With Couchbase and Arize, organizations have the tools to confidently navigate the challenges of enterprise GenAI deployment.

Additional Resources

 

Author

Posted by Richard Young - Dir. Partner Solutions Architecture, Arize AI

Leave a reply