Part 3: Building a Document Processor for RAG Chatbots#
Welcome to Part 3 in our 4-part series on building intelligent document systems using LangChain. In Part 1 , we learned how to load and chunk documents. In Part 2 , we converted those chunks into embeddings and stored them in a vector database.
Now, in this post, we’ll design a reusable DocumentProcessor
class that puts it all together—and becomes the backbone of a Retrieval-Augmented Generation (RAG) chatbot.
Why a Document Processor Class?#
Instead of writing repeated code every time we load or search a document, we wrap this logic into a class. The DocumentProcessor
does three key things:
- Loads and splits documents
- Converts chunks into embeddings
- Stores embeddings in a vector database
- Retrieves relevant chunks for a query
This class allows us to cleanly separate the document pipeline from the chatbot logic.
Core Design: Why We Store chunk_size
and vectorstore
as State#
Here’s the constructor:
class DocumentProcessor:
def __init__(self):
self.chunk_size = 1000
self.chunk_overlap = 100
self.embedding_model = OpenAIEmbeddings()
self.vectorstore = None # Initialized only after first document is processed
Why store chunk_size
and chunk_overlap
in the class?#
These values affect how documents are split. Keeping them as instance variables allows:
- Easy configuration (you can change them per instance)
- Consistency across all documents processed by that object
Why not initialize vectorstore
in the constructor?#
We delay (or “lazy initialize”) the vector store because:
- We don’t have any documents yet
- FAISS requires data to create the index
- This lets us build the vector store only when the first document is processed
Implementation Overview#
def process_document(self, file_path):
docs = self.load_document(file_path)
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=self.chunk_size,
chunk_overlap=self.chunk_overlap
)
split_docs = text_splitter.split_documents(docs)
if self.vectorstore is None:
self.vectorstore = FAISS.from_documents(split_docs, self.embedding_model)
else:
self.vectorstore.add_documents(split_docs)
This method handles everything:
- Loads the document
- Splits it into chunks
- Generates embeddings
- Adds them to the FAISS vector store
Retrieval Method#
def retrieve_relevant_context(self, query, k=3):
if self.vectorstore is None:
return []
return self.vectorstore.similarity_search(query, k=k)
This method lets the chatbot fetch relevant document chunks when answering a question.
Reset Method#
def reset(self):
self.vectorstore = None
Useful if you want to clear everything and start fresh, say when loading a new document set.
Who Is the Client of This Class?#
The chatbot system is the client. It uses DocumentProcessor
to:
- Process and embed documents
- Retrieve relevant context for a query
- Inject that context into a prompt for a language model (like GPT-4)
Full Example Usage#
from document_processor import DocumentProcessor
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
# Step 1: Initialize document processor
processor = DocumentProcessor()
# Step 2: Process a PDF
file_path = "your_file.pdf"
processor.process_document(file_path)
# Step 3: Ask a question
query = "What is the theme of this book?"
relevant_docs = processor.retrieve_relevant_context(query)
# Step 4: Build a prompt
context = "\n\n".join([doc.page_content for doc in relevant_docs])
prompt_template = ChatPromptTemplate.from_template(
"Answer the following question based on the provided context.\n\n"
"Context:\n{context}\n\n"
"Question: {question}"
)
prompt = prompt_template.format(context=context, question=query)
# Step 5: Get response
chat = ChatOpenAI()
response = chat.invoke(prompt)
print(f"Q: {query}")
print(f"A: {response.content}")
Recap#
- We created a
DocumentProcessor
class to handle loading, chunking, embedding, and retrieval - We stored
chunk_size
,overlap
, andvectorstore
as state for clean, reusable logic - We showed how the chatbot becomes a client of this processor, retrieving context to build prompts
Coming Next: Part 4 – Building the Complete CLI Chatbot#
In the final post, we’ll build the full CLI chatbot experience:
- Accept user input
- Retrieve matching context
- Generate and stream model responses
- Maintain history and state across turns
Stay tuned for the grand finale.