Skip to main content

Building a Document Processor for RAG Chatbots (Part 3)

·594 words·3 mins
Khalid Rizvi
Author
Khalid Rizvi
Where Legacy Meets GenAI
LangChain Document Intelligence - This article is part of a series.
Part : This Article

Part 3: Building a Document Processor for RAG Chatbots
#

Welcome to Part 3 in our 4-part series on building intelligent document systems using LangChain. In Part 1 , we learned how to load and chunk documents. In Part 2 , we converted those chunks into embeddings and stored them in a vector database.

Now, in this post, we’ll design a reusable DocumentProcessor class that puts it all together—and becomes the backbone of a Retrieval-Augmented Generation (RAG) chatbot.


Why a Document Processor Class?
#

Instead of writing repeated code every time we load or search a document, we wrap this logic into a class. The DocumentProcessor does three key things:

  1. Loads and splits documents
  2. Converts chunks into embeddings
  3. Stores embeddings in a vector database
  4. Retrieves relevant chunks for a query

This class allows us to cleanly separate the document pipeline from the chatbot logic.


Core Design: Why We Store chunk_size and vectorstore as State
#

Here’s the constructor:

class DocumentProcessor:
    def __init__(self):
        self.chunk_size = 1000
        self.chunk_overlap = 100
        self.embedding_model = OpenAIEmbeddings()
        self.vectorstore = None  # Initialized only after first document is processed

Why store chunk_size and chunk_overlap in the class?
#

These values affect how documents are split. Keeping them as instance variables allows:

  • Easy configuration (you can change them per instance)
  • Consistency across all documents processed by that object

Why not initialize vectorstore in the constructor?
#

We delay (or “lazy initialize”) the vector store because:

  • We don’t have any documents yet
  • FAISS requires data to create the index
  • This lets us build the vector store only when the first document is processed

Implementation Overview
#

def process_document(self, file_path):
    docs = self.load_document(file_path)
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=self.chunk_size,
        chunk_overlap=self.chunk_overlap
    )
    split_docs = text_splitter.split_documents(docs)

    if self.vectorstore is None:
        self.vectorstore = FAISS.from_documents(split_docs, self.embedding_model)
    else:
        self.vectorstore.add_documents(split_docs)

This method handles everything:

  • Loads the document
  • Splits it into chunks
  • Generates embeddings
  • Adds them to the FAISS vector store

Retrieval Method
#

def retrieve_relevant_context(self, query, k=3):
    if self.vectorstore is None:
        return []
    return self.vectorstore.similarity_search(query, k=k)

This method lets the chatbot fetch relevant document chunks when answering a question.


Reset Method
#

def reset(self):
    self.vectorstore = None

Useful if you want to clear everything and start fresh, say when loading a new document set.


Who Is the Client of This Class?
#

The chatbot system is the client. It uses DocumentProcessor to:

  1. Process and embed documents
  2. Retrieve relevant context for a query
  3. Inject that context into a prompt for a language model (like GPT-4)

Full Example Usage
#

from document_processor import DocumentProcessor
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

# Step 1: Initialize document processor
processor = DocumentProcessor()

# Step 2: Process a PDF
file_path = "your_file.pdf"
processor.process_document(file_path)

# Step 3: Ask a question
query = "What is the theme of this book?"
relevant_docs = processor.retrieve_relevant_context(query)

# Step 4: Build a prompt
context = "\n\n".join([doc.page_content for doc in relevant_docs])
prompt_template = ChatPromptTemplate.from_template(
    "Answer the following question based on the provided context.\n\n"
    "Context:\n{context}\n\n"
    "Question: {question}"
)
prompt = prompt_template.format(context=context, question=query)

# Step 5: Get response
chat = ChatOpenAI()
response = chat.invoke(prompt)

print(f"Q: {query}")
print(f"A: {response.content}")

Recap
#

  • We created a DocumentProcessor class to handle loading, chunking, embedding, and retrieval
  • We stored chunk_size, overlap, and vectorstore as state for clean, reusable logic
  • We showed how the chatbot becomes a client of this processor, retrieving context to build prompts

Coming Next: Part 4 – Building the Complete CLI Chatbot
#

In the final post, we’ll build the full CLI chatbot experience:

  • Accept user input
  • Retrieve matching context
  • Generate and stream model responses
  • Maintain history and state across turns

Stay tuned for the grand finale.

LangChain Document Intelligence - This article is part of a series.
Part : This Article