Don’t you love reading other people’s commit messages? No? Well, I do and as I was reading a very insightful commit message, I realized all the untapped content living in various Git logs (assuming the dev you follow are writing useful messages, of course). So, wouldn’t it be great if you could ask questions to a repo? Let’s see how this can be achieved doing RAG with Couchbase Shell.

TL;DR

Couchbase Shell configuration

The initial step is to install and configure cbsh. I am going to use my Capella instance. To get the config you can go under the Connect tab of your Capella cluster and select Couchbase Shell. This is the config under [[cluster]]. To configure the model, take a look at what’s under [[llm]]. I have chosen OpenAI but there are others. You need to define the model used for the embedding (that’s what turns text into a vector) and one for the Chat. This one takes the question and some additional context to answer the question. And of course you will need an API key.

You also need Git installed, then you should be all set.

Import Git commit log

The first step is to get all the commits of the repo in JSON. Being lazy and old, and by old I mean not used to asking an AI, I searched for this on Google, found a number of Gists, that linked to other Gists, and I finally settled on this one.

I downloaded it, sourced it, went into my local couchbase-shell git repo and called it.

But, for the benefit of the reader wondering if I made the right decision, let’s ask the configured model. Cbsh has an ask command allowing you to to this:

This command will output each commit in the repository as a JSON object with the commit hash, author name and email, commit date, and commit message. The --all flag ensures all branches are included. The --reverse flag lists the commits in reverse chronological order. Finally, the output is redirected to a commits.json file.

Please make sure you run this command in the root directory of the Git repository you want to get the commits from.

And as it turns out, it does not work out of the box (shocking I know). And it did not have all the info I needed, like the body part of the message. Of course we could spend time tuning this, but it’s very specific, with lots of edge cases.

In any case I now have a list of commits in JSON format:

So what can you do with a JSON array of JSON objects? You can import it through the Capella UI or you can import them with Couchbase Shell. I first create the scope and collection and select them with cb-env, then create the SQL++ Index.

Since cbsh is based on Nushell, the resulting JSON file can be easily opened, turned into a dataframe, transformed in a Couchbase document and inserted like so:

Let’s get some documents just to see how it worked:

So this is content we could use for RAG. Time to enrich these docs.

Enrich document with an AI model

To enrich the doc you need to have a model configured. Here I am using OpenAI and the enrich-doc cbsh command:

The SELECT clause will return a JSON object with the content of the doc, and additional fields id and text. Text is the subject and body appended into one string. The object is wrapped in a content object and given to the vector enrich-doc command, with text as a parameter, as it is the field that will be transformed in a vector. There should now be a textVector field in each doc.

Vector Search

In order to search through these vectors, we need to create a Vector Search index. It’s doable through the API or UI for something customizable. Here I am happy with default choices so I use cbsh instead:

The index created will use dot_product as a similarity algorithm, vector dimensionality will be 1536, the name of the index is commit and the indexed field is textVector. The bucket, scope and collection are the one selected through cb-env.

To test vector search, the search query has to be turned in a vector, than piped to the search:

It returns 3 rows by default. Let’s extend it to see the content of the document. I am adding reject -i textVector to remove the vector field, because no one needs a 1536 lines field in their terminal output:

Ask your Git Repository

From here you have all the commits of a Git repository stored in Couchbase, enriched with an AI model, and all indexed and searchable. The last thing to do is call the model to run a query with RAG. It starts by a turning a question into a vector, pipe it to a vector search, get the full document from the return IDs, select the content object without the vector field, turn each object in a JSON doc (this way we can send the content and its structured metadata), wrap the jsonText in a table and finally pipe it to the ask command:

Asking the LLM when Gemini support was introduced. We get a date and a commit hash. It’s then easy to verify using git show. There is a bit of repetition here so you can declare a variable for your question and reuse it:

And now we all know why the client crate had to be rewritten. It may not answer your own questions, but now you know how to get answers from any repo!

Author

Posted by Laurent Doguin

Laurent is a nerdy metal head who lives in Paris. He mostly writes code in Java and structured text in AsciiDoc, and often talks about data, reactive programming and other buzzwordy stuff. He is also a former Developer Advocate for Clever Cloud and Nuxeo where he devoted his time and expertise to helping those communities grow bigger and stronger. He now runs Developer Relations at Couchbase.

2 Comments

  1. Very cool. It would be interesting to include the full changelog, to give the LLM more context.

    1. Laurent Doguin April 1, 2025 at 6:28 am

      Yeah I was thinking about Github PR as well. Plenty of potential!

Leave a reply