In part 1 we saw how to scrape Twitter, turn tweets in JSON documents, get an embedding representation of that tweet, store everything in Couchbase and how to run a vector search. These are the first steps of a Retrieval Augmented Generation architecture that could summarize a twitter thread. The next step is to use a Large Language Model. We can prompt it to summarize the thread, and we can enrich the context of the prompt thanks to Vector Search.

LangChain and Streamlit

So how do we make this all work together with an LLM? That’s where the LangChain project can help. Its goal is to enable developers to build LLM-based applications. We already have some samples available on GitHub that showcase our LangChain module. Like this RAG demo allowing the user to upload a PDF, vectorize it, store it in Couchbase and use it in a chatbot. That one is in JavaScript, but there is also a Python version.

As it turns out, this is exactly what I want to do, except it’s using a PDF instead of a list of tweets. So I forked it and started playing with it here.  Here, Nithish is using a couple interesting libraries, LangChain of course, and Streamlit. Another cool thing to learn! Streamlit is like a PaaS meet low code meet data science service. It allows you to deploy data-based apps very easily, with minimum code, in a very, very opinionated way.

Configuration

Let’s break down the code in smaller chunks. We can start with the configuration. The following method makes sure the right environment variables are set, and stops the application deployment if they are not.

The check_environment_variable method is called several time to make sure the needed configuration is set, and if not will stop the app.

This means everything in there is needed. A connection to OpenAI and to Couchbase. Let’s quickly talk about Couchbase. It’s a JSON, multi-model distributed database with an integrated cache. You can use it as K/V, SQL, Full-text Search, Time Series, Analytics, and we added fantastic new features in 7.6: Recursive CTEs to do graph queries, or the one that interests us most today, Vector Search. Fastest way to try is to go to cloud.couchbase.com, there is a 30 day trial, no credit card required.

From there you can follow the steps and get your new cluster setup. Setup a bucket, scope, collection and index, a user and make sure your cluster is available from outside and you can move on to the next part. Getting a connection to Couchbase from the app. It can be done with these two functions. You can see they are annotated with @st.cache_resource. It’s used to cache the object from Streamlit’s perspective. It makes it available for other instances or reruns. Here’s the doc excerpt

Decorator to cache functions that return global resources (e.g. database connections, ML models).

Cached objects are shared across all users, sessions, and reruns. They must be thread-safe because they can be accessed from multiple threads concurrently. If thread safety is an issue, consider using st.session_state to store resources per session instead.

So with this we have a connection to the Couchbase cluster and a connection to the LangChain Couchbase vector store wrapper.

connect_to_couchbase(connection_string, db_username, db_password) creates the Couchbase cluster connection. get_vector_store(_cluster, db_bucket, db_scope, db_collection, _embedding, index_name,) creates the CouchabseVectorStore wrapper. It holds a connection to the cluster, the bucket/scope/collection information to store data, the index name to make sure we can query the vectors, and and embedding property.

Here it refers to the OpenAIEmbeddings function. It will automatically pick up the OPENAI_API_KEY and allow LangChain to use OpenAI’s API with the key. Every API call will be made transparent by LangChain. Which also means that switching model provider should be fairly transparent when it comes to embedding management.

Writing LangChain Documents to Couchbase

Now, where the magic happens, where we get the tweets, parse them as JSON, create the embedding and write the JSON doc to the specific Couchbase collection. Thanks to Steamlit we can set up a file upload widget and execute an associated function:

It looks somewhat similar to the code in part 1, except all the embedding creation is managed transparently by LangChain. The text field will be vectorized, the metadata will be added to the Couchbase doc. It will look like this:

From now on we have functions to manage the tweets upload, vectorize the tweets and store them in Couchbase. Time to use Streamlit to build the actual app and manage the chat flow. Let’s split that function into several chunks.

Write a Streamlit Application

Starting with the main declaration and the protection of the app. You don’t want anyone to use it, and use your OpenAI credits. Thanks to Streamlit it can be done fairly easily. Here we setup a password protection using the LOGIN_PASSWORD env variable. And we also setup the global page config thanks to the set_page_config method. This will give you a simple form to enter the password, and a simple page.

To go a bit further we can add the environment variable checks, OpenAI and Couchbase configuration, and a simple title to start the app flow.

Streamlit has a nice codespace integration, I really encourage you to use it, it makes development really easy. And our VSCode plugin can be installed, so you can browse Couchbase and execute queries.

Run SQ++ Vector Search query from Codespace

Run SQL++ Vector Search query from Codespace

A Basic streamlit application opened in codespace

A Basic Streamlit application opened in Codespace

Create LangChain Chains

After that comes the chain setup. That’s really where LangChain shines. This is where we can set up the retriever. It’s going to be used by LangChain to query Couchbase for all the vectorized tweets. Then it’s time to build the RAG prompt. You can see the template takes a {context} and {question} parameter. We create a Chat prompt object from the template.

After that comes the LLM choice, here I chose GPT4. And finally the chain creation.

The chain is built from the chosen model, the context and query parameters, the prompt object and a StrOuptutParser. Its role is to parse the LLM response and send it back as a streamable/chunkable string. The RunnablePassthrough method called for the question parameter is used to make sure it’s passed to the prompt ‘as is’ but you can use other methods to change/sanitize the question. That’s it, a RAG architecture. Giving some additional context to an LLM prompt to get a better answer.

We can also build one chain without it to compare the results:

No need for context in the prompt template and chain parameter, and no need for a retriever.

Now that we have a couple chain, we can use them through Streamlit. This code will add the first question and the sidebar, allowing for fileupload:

Then the instructions and input logic:

 

With that you have everything needed to run the streamlit app that allows the user to:

    • Upload a JSON file containing tweets
    • Transform each tweet into a LangChain Document
    • Store them in Couchbase along with their embedding representation
    • Manage two different prompts:
      • one with a LangChain retriever to add context
      • and one without

If you run the app you should see something like this:

The full streamlit application example opened in Codespace

The full streamlit application example opened in Codespace

Conclusion

And when you ask “are Socks are important to developers ?”, you get those two very interesting answers:

Based on the context provided, it seems that socks are important for some developers, as mentioned by Josh Long and Simon Willison in their tweets. They express a desire for socks and seem to value them.

Socks are important for developers as they provide comfort and support while spending long hours sitting at a computer. Additionally, keeping feet warm can help improve focus and productivity.

Voilà, we have a bot that knows about a twitter thread, and can answer accordingly. And the fun thing is it did not use just the text Vector in the context, it also used all the metadata stored like the username, because we also indexed all the LangChain document metadata when creating the Index in part 1.

But is this really summarizing the X thread? Not really. Because Vector Search will enrich context with closest documents and not the full thread. So there is a bit of data engineering to do. Let’s talk about this in the next part!

Resources

Author

Posted by Laurent Doguin, Developer Advocate, Couchbase

Laurent is a Paris based Developer Advocate where he focuses on helping Java developers and the French community. He writes code in Java and blog posts in Markdown. Prior to joining Couchbase he was Nuxeo’s community liaison where he devoted his time and expertise to helping the entire Nuxeo Community become more active and efficient.

Leave a reply