A Step-by-Step Guide to Preparing Data for Retrieval-Augmented Generation (RAG)

In today’s data-driven world, the ability to efficiently gather and prepare data is crucial for the success of any application. Whether you’re developing a chatbot, a recommendation system, or any AI-driven solution, the quality and structure of your data can make or break your project. In this article, we’ll take you on a journey to explore the process of information gathering and smart chunking, focusing on how to prepare data for Retrieval-Augmented Generation (RAG) in any application with your database of choice.

High level overview of converting docs for RAG

Data Collection: The Foundation of RAG

The Magic of Scrapy

Imagine a spider, not the creepy kind, but a diligent librarian spider in the massive library of the internet. This spider, embodied by Scrapy’s Spider class, starts at the entrance (the starting URL) and methodically visits every room (webpage), collecting precious books (HTML pages). Whenever it finds a door to another room (a hyperlink), it opens it and continues its exploration, ensuring no room is left unchecked. This is how Scrapy works—systematically and meticulously gathering every piece of information.

Leveraging Scrapy for Data Collection

Scrapy is a Python-based framework designed to extract data from websites. It’s like giving our librarian spider superpowers. With Scrapy, we can build web spiders that navigate through web pages and extract desired information with precision. In our case, we deploy Scrapy to crawl the Couchbase documentation website and download HTML pages for further processing and analysis.

Setting Up Your Scrapy Project

Before our spider can start its journey, we need to set up a Scrapy project. Here’s how you do it:

Install Scrapy: If you haven’t already installed Scrapy, you can do so using pip:

pip install scrapy

1

pip install scrapy
Create a New Scrapy Project: Set up your new Scrapy project with the following command:

scrapy startproject couchbase_docs

1

scrapy startproject couchbase_docs

Crafting the Spider

With the Scrapy project set up, we now create the spider that will crawl through the Couchbase documentation website and download HTML pages. Here’s how it looks:

from pathlib import Path
import scrapy

class CouchbaseSpider(scrapy.Spider):
    name = "couchbase"
    start_urls = ["https://docs.couchbase.com/home/index.html",]

    def parse(self, response):
        # Download HTML content of the current page
        page = response.url.split("/")[-1]
        filename = f"{page}.html"
        Path(filename).write_bytes(response.body)
        self.log(f"Saved file {filename}")

        # Extract links and follow them
        for href in response.css("ul a::attr(href)").getall():
            if href.endswith(".html") or "docs.couchbase.com" in href:
                yield response.follow(href, self.parse)

from pathlib import Path

import scrapy

class CouchbaseSpider(scrapy.Spider):

name = "couchbase"

start_urls = ["https://docs.couchbase.com/home/index.html",]

def parse(self, response):

# Download HTML content of the current page

page = response.url.split("/")[-1]

filename = f"{page}.html"

Path(filename).write_bytes(response.body)

self.log(f"Saved file {filename}")

# Extract links and follow them

for href in response.css("ul a::attr(href)").getall():

if href.endswith(".html") or "docs.couchbase.com" in href:

yield response.follow(href, self.parse)

Running the Spider

To run the spider and initiate the data collection process, execute the following command within the Scrapy project directory:

scrapy crawl couchbase

1	scrapy crawl couchbase

This command will start the spider, which will begin crawling the specified URLs and saving the HTML content. The spider extracts links from each page and follows them recursively, ensuring comprehensive data collection.

By automating data collection with Scrapy, we ensure all relevant HTML content from the Couchbase documentation website is retrieved efficiently and systematically, laying a solid foundation for further processing and analysis.

Extracting Text Content: Transforming Raw Data

After collecting HTML pages from the Couchbase documentation website, the next crucial step is to extract the text content. This transforms raw data into a usable format for analysis and further processing. Additionally, we may have PDF files containing valuable data, which we’ll also extract. Here, we’ll discuss how to use Python scripts to parse HTML files and PDFs, extract text data, and store it for further processing.

Extracting Text from HTML Pages

To extract text content from HTML pages, we’ll use a Python script that parses the HTML files and retrieves text data enclosed within <p> tags. This approach captures the main body of text from each page, excluding any HTML markup or structural elements.

Python Function for Text Extraction

Below is a Python function that demonstrates how to extract text content from HTML pages and store it in text files:

from bs4 import BeautifulSoup

def get_data(html_content):
    soup = BeautifulSoup(html_content, "html.parser")

    title = str(soup).split('<title>')[1].split('</title>')[0]
    if " | Couchbase Docs" in title:
        title = title[:(title.index(" | Couchbase Docs"))].replace(" ", "_")
    else:
        title = title.replace(" ", "_")
    data = ""
    lines = soup.find_all('p')
    for line in lines:
        data += " " + line.text
    return title, data

from bs4 import BeautifulSoup

def get_data(html_content):

soup = BeautifulSoup(html_content, "html.parser")

title = str(soup).split('<title>')[1].split('</title>')[0]

if " | Couchbase Docs" in title:

title = title[:(title.index(" | Couchbase Docs"))].replace(" ", "_")

else:

title = title.replace(" ", "_")

data = ""

lines = soup.find_all('p')

for line in lines:

data += " " + line.text

return title, data

How to Use?

To use the get_data() function, incorporate it into your Python script or application and provide the HTML content as a parameter. The function will return the extracted text content.

html_content = '<html><head><title>Sample Page</title></head><body><p>This is a sample paragraph.</p></body></html>'
title, text = get_data(html_content)
print(title)  # Output: Sample_Page
print(text)   # Output: This is a sample paragraph.

html_content = '<html><head><title>Sample Page</title></head><body><p>This is a sample paragraph.</p></body></html>'

title, text = get_data(html_content)

print(title) # Output: Sample_Page

print(text) # Output: This is a sample paragraph.

Extracting Text Content from PDFs

For extracting text content from PDFs, we’ll use a Python script that reads a PDF file and retrieves its data. This process ensures that all relevant textual information is captured for analysis.

Python Function for Text Extraction

Below is a Python function that demonstrates how to extract text content from PDFs:

from PyPDF2 import PdfReader

def extract_text_from_pdf(pdf_file):
    reader = PdfReader(pdf_file)
    text = ''
    for page in reader.pages:
        text += page.extract_text()
    return text

from PyPDF2 import PdfReader

def extract_text_from_pdf(pdf_file):

reader = PdfReader(pdf_file)

text = ''

for page in reader.pages:

text += page.extract_text()

return text

How to Use?

To use the extract_text_from_pdf() function, incorporate it into your Python script or application and provide the PDF file path as a parameter. The function will return the extracted text content.

pdf_path = 'sample.pdf'
text = extract_text_from_pdf(pdf_path)

1 2	pdf_path = 'sample.pdf' text = extract_text_from_pdf(pdf_path)

With the text content extracted and saved, we’ve completed the process of getting data from the Couchbase documentation.

Chunking: Making Data Manageable

Imagine you have a lengthy novel and you want to create a summary. Instead of reading the entire book at once, you break it down into chapters, paragraphs, and sentences. This way, you can easily understand and process each part, making the task more manageable. Similarly, chunking in text processing helps in dividing large texts into smaller, meaningful units. By organizing text into manageable chunks, we can facilitate easier processing, retrieval, and analysis of information.

Semantic and Content Chunking for RAG

For Retrieval-Augmented Generation (RAG), chunking is particularly important. We implemented both semantic and content chunking methods to optimize the data for the RAG process, which involves retrieving relevant information and generating responses based on that information.

Recursive Character Text Splitter

Recursive character text splitter chunking by Langchain involves breaking down a piece of text into smaller chunks using recursive patterns within the characters of the text. This technique utilizes separators such as \n\n (double newline), \n (newline), (space), and "" (empty string).

Semantic Chunking

Semantic chunking is a text processing technique that focuses on grouping words or phrases based on their semantic meaning or context. This approach enhances understanding by creating meaningful chunks that capture the underlying relationships in the text. It is particularly useful for tasks that require detailed analysis of text structure and content organization.

Implementation of Semantic and Content Chunking

For our project, we implemented both semantic and content chunking methods. Semantic chunking preserves the hierarchical structure of the text, ensuring that each chunk maintains its contextual integrity. Content chunking was applied to remove redundant chunks and optimize storage and processing efficiency.

Python Implementation

Here’s a Python implementation of semantic and content chunking:

import hashlib
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Global set to store unique chunk hash values across all files
global_unique_hashes = set()

def hash_text(text):
    # Generate a hash value for the text using SHA-256
    hash_object = hashlib.sha256(text.encode())
    return hash_object.hexdigest()

def chunk_text(text, title, Chunk_size=2000, Overlap=50, Length_function=len, debug_mode=0):
    global global_unique_hashes

    chunks = RecursiveCharacterTextSplitter(
        chunk_size=Chunk_size,
        chunk_overlap=Overlap,
        length_function=Length_function
    ).create_documents([text])

    if debug_mode:
        for idx, chunk in enumerate(chunks):
            print(f"Chunk {idx+1}: {chunk}\n")
        print('\n')

    # Deduplication mechanism
    unique_chunks = []
    for chunk in chunks:
        chunk_hash = hash_text(chunk.page_content)
        if chunk_hash not in global_unique_hashes:
            unique_chunks.append(chunk)
            global_unique_hashes.add(chunk_hash)

    for sentence in unique_chunks:
        sentence.page_content = title + " " + sentence.page_content

    return unique_chunks

import hashlib

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Global set to store unique chunk hash values across all files

global_unique_hashes = set()

def hash_text(text):

# Generate a hash value for the text using SHA-256

hash_object = hashlib.sha256(text.encode())

return hash_object.hexdigest()

def chunk_text(text, title, Chunk_size=2000, Overlap=50, Length_function=len, debug_mode=0):

global global_unique_hashes

chunks = RecursiveCharacterTextSplitter(

chunk_size=Chunk_size,

chunk_overlap=Overlap,

length_function=Length_function

).create_documents([text])

if debug_mode:

for idx, chunk in enumerate(chunks):

print(f"Chunk {idx+1}: {chunk}\n")

print('\n')

# Deduplication mechanism

unique_chunks = []

for chunk in chunks:

chunk_hash = hash_text(chunk.page_content)

if chunk_hash not in global_unique_hashes:

unique_chunks.append(chunk)

global_unique_hashes.add(chunk_hash)

for sentence in unique_chunks:

sentence.page_content = title + " " + sentence.page_content

return unique_chunks

These optimized chunks are then embedded and stored in the Couchbase cluster for efficient retrieval, ensuring seamless integration with the RAG process.

By employing both semantic and content chunking techniques, we effectively structured and optimized textual data for the RAG process and storage in the Couchbase cluster. The next step is to embed the chunks we just generated.

Embedding Chunks: Mapping the Galaxy of Data

Imagine each chunk of text as a star in a vast galaxy. By embedding these chunks, we assign each star a precise location in this galaxy, based on its characteristics and relationships with other stars. This spatial mapping allows us to navigate the galaxy more effectively, finding connections and understanding the broader universe of information.

Embedding Text Chunks for RAG

Embedding text chunks is a crucial step in the Retrieval-Augmented Generation (RAG) process. It involves transforming text into numerical vectors that capture the semantic meaning and context of each chunk, making it easier for machine learning models to analyze and generate responses.

Utilizing the BAAI Model BGE-M3

To embed the chunks, we use the BAAI model BGE-M3. This model is capable of embedding text into a high-dimensional vector space, capturing the semantic meaning and context of each chunk.

Embedding Function

The embedding function takes the chunks generated from the previous step and embeds each chunk into a 1024-dimensional vector space using the BAAI model BGE-M3. This process enhances the representation of each chunk, facilitating more accurate and contextually rich analysis.

Python Script for Embedding

Here’s a Python script that demonstrates how to embed text chunks using the BAAI model BGE-M3:

import json
import numpy as np
from json import JSONEncoder
from baai_model import BGEM3FlagModel  

embed_model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)

class NumpyEncoder(JSONEncoder):
    def default(self, obj):
        if isinstance(obj, np.ndarray):
            return obj.tolist()
        return JSONEncoder.default(self, obj)

def embed(chunks):
    embedded_chunks = []
    for sentence in chunks:
        emb = embed_model.encode(str(sentence.page_content), batch_size=12, max_length=600)['dense_vecs']
        embedding = np.array(emb)
        np.set_printoptions(suppress=True)

        json_dump = json.dumps(embedding, cls=NumpyEncoder)
        embedded_chunk = {
            "data": str(sentence.page_content),
            "vector_data": json.loads(json_dump)
        }
        embedded_chunks.append(embedded_chunk)
    return embedded_chunks

import json

import numpy as np

from json import JSONEncoder

from baai_model import BGEM3FlagModel

embed_model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=True)

class NumpyEncoder(JSONEncoder):

def default(self, obj):

if isinstance(obj, np.ndarray):

return obj.tolist()

return JSONEncoder.default(self, obj)

def embed(chunks):

embedded_chunks = []

for sentence in chunks:

emb = embed_model.encode(str(sentence.page_content), batch_size=12, max_length=600)['dense_vecs']

embedding = np.array(emb)

np.set_printoptions(suppress=True)

json_dump = json.dumps(embedding, cls=NumpyEncoder)

embedded_chunk = {

"data": str(sentence.page_content),

"vector_data": json.loads(json_dump)

}

embedded_chunks.append(embedded_chunk)

return embedded_chunks

How to Use?

To use the embed() function, incorporate it into your Python script or application and provide the chunks generated from the previous steps as input. The function will return a list of embedded chunks.

chunks = [
    # Assume chunks is a list of text chunks generated previously
    {"page_content": "This is the first chunk of text."},
    {"page_content": "This is the second chunk of text."}
]

embedded_chunks = embed(chunks)

chunks = [

# Assume chunks is a list of text chunks generated previously

{"page_content": "This is the first chunk of text."},

{"page_content": "This is the second chunk of text."}

]

embedded_chunks = embed(chunks)

These optimized chunks, now embedded in a high-dimensional vector space, are ready for storage and retrieval, ensuring efficient utilization of resources and seamless integration with the RAG process. By embedding the text chunks, we transform raw text into a format that machine learning models can efficiently process and analyze, enabling more accurate and contextually aware responses in the RAG system.

Storing Embedded Chunks: Ensuring Efficient Retrieval

Once the text chunks have been embedded, the next step is to store these vectors in a database. These embedded chunks can be pushed into vector databases or traditional databases with vector search support, such as Couchbase, Elasticsearch, or Pinecone, to facilitate efficient retrieval for Retrieval-Augmented Generation (RAG) applications.

Vector Databases

Vector databases are designed specifically to handle and search through high-dimensional vectors efficiently. By storing embedded chunks in a vector database, we can leverage advanced search capabilities to quickly retrieve the most relevant information based on the context and semantic meaning of queries.

Integrating with RAG Applications

With the data prepared and stored, it is now ready to be used in RAG applications. The embedded vectors enable these applications to retrieve contextually relevant information and generate more accurate and meaningful responses, enhancing the overall user experience.

Conclusion

By following this guide, we have successfully prepared data for Retrieval-Augmented Generation. We covered data collection using Scrapy, text content extraction from HTML and PDFs, chunking techniques, and embedding text chunks using the BAAI model BGE-M3. These steps ensure that the data is organized, optimized, and ready for use in RAG applications.

For more such technical and engaging content, check out other vector search-related blogs on our website and stay tuned for the next part in this series.

References

Contributors

Sanjivani Patra - Software Engineer

Platform

Self-Managed

Services

Capabilities

Why Couchbase?

Migrate to Capella

By Use Case

By Industry

By Application Need

Popular Docs

By Developer Role

COMMUNITY

Join the Developer Community

Resource Center

Education

Compare

About

Partnerships

Our Services

Partners: Register a Deal

Ready to register a deal with Couchbase?

Marriott

All Posts

A Step-by-Step Guide to Preparing Data for Retrieval-Augmented Generation (RAG)

Data Collection: The Foundation of RAG

The Magic of Scrapy

Leveraging Scrapy for Data Collection

Setting Up Your Scrapy Project

Crafting the Spider

Running the Spider

Extracting Text Content: Transforming Raw Data

Extracting Text from HTML Pages

Python Function for Text Extraction

How to Use?

Extracting Text Content from PDFs

Python Function for Text Extraction

How to Use?

Chunking: Making Data Manageable

Semantic and Content Chunking for RAG

Recursive Character Text Splitter

Semantic Chunking

Implementation of Semantic and Content Chunking

Python Implementation

Embedding Chunks: Mapping the Galaxy of Data

Embedding Text Chunks for RAG

Utilizing the BAAI Model BGE-M3

Embedding Function

Python Script for Embedding

How to Use?

Storing Embedded Chunks: Ensuring Efficient Retrieval

Vector Databases

Integrating with RAG Applications

Conclusion

References

Contributors

Author

Posted by Sanjivani Patra - Software Engineer

Leave a reply Cancel reply