LangChain Document Intelligence - This article is part of a series.

Part : This Article

Part : Embedding and Vector Storage with LangChain (Part 2)

Part : Document Processing and Retrieval with LangChain in Python

Part 3: Building a Document Processor for RAG Chatbots
#

Welcome to Part 3 in our 4-part series on building intelligent document systems using LangChain. In Part 1 , we learned how to load and chunk documents. In Part 2 , we converted those chunks into embeddings and stored them in a vector database.

Now, in this post, we’ll design a reusable DocumentProcessor class that puts it all together—and becomes the backbone of a Retrieval-Augmented Generation (RAG) chatbot.

Why a Document Processor Class?
#

Instead of writing repeated code every time we load or search a document, we wrap this logic into a class. The DocumentProcessor does three key things:

Loads and splits documents
Converts chunks into embeddings
Stores embeddings in a vector database
Retrieves relevant chunks for a query

This class allows us to cleanly separate the document pipeline from the chatbot logic.

Core Design: Why We Store `chunk_size` and `vectorstore` as State
#

Here’s the constructor:

class DocumentProcessor:
    def __init__(self):
        self.chunk_size = 1000
        self.chunk_overlap = 100
        self.embedding_model = OpenAIEmbeddings()
        self.vectorstore = None  # Initialized only after first document is processed

Why store `chunk_size` and `chunk_overlap` in the class?
#

These values affect how documents are split. Keeping them as instance variables allows:

Easy configuration (you can change them per instance)
Consistency across all documents processed by that object

Why not initialize `vectorstore` in the constructor?
#

We delay (or “lazy initialize”) the vector store because:

We don’t have any documents yet
FAISS requires data to create the index
This lets us build the vector store only when the first document is processed

Implementation Overview
#

def process_document(self, file_path):
    docs = self.load_document(file_path)
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=self.chunk_size,
        chunk_overlap=self.chunk_overlap
    )
    split_docs = text_splitter.split_documents(docs)

    if self.vectorstore is None:
        self.vectorstore = FAISS.from_documents(split_docs, self.embedding_model)
    else:
        self.vectorstore.add_documents(split_docs)

This method handles everything:

Loads the document
Splits it into chunks
Generates embeddings
Adds them to the FAISS vector store

Retrieval Method
#

def retrieve_relevant_context(self, query, k=3):
    if self.vectorstore is None:
        return []
    return self.vectorstore.similarity_search(query, k=k)

This method lets the chatbot fetch relevant document chunks when answering a question.

Reset Method
#

def reset(self):
    self.vectorstore = None

Useful if you want to clear everything and start fresh, say when loading a new document set.

Who Is the Client of This Class?
#

The chatbot system is the client. It uses DocumentProcessor to:

Process and embed documents
Retrieve relevant context for a query
Inject that context into a prompt for a language model (like GPT-4)

Full Example Usage
#

from document_processor import DocumentProcessor
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

# Step 1: Initialize document processor
processor = DocumentProcessor()

# Step 2: Process a PDF
file_path = "your_file.pdf"
processor.process_document(file_path)

# Step 3: Ask a question
query = "What is the theme of this book?"
relevant_docs = processor.retrieve_relevant_context(query)

# Step 4: Build a prompt
context = "\n\n".join([doc.page_content for doc in relevant_docs])
prompt_template = ChatPromptTemplate.from_template(
    "Answer the following question based on the provided context.\n\n"
    "Context:\n{context}\n\n"
    "Question: {question}"
)
prompt = prompt_template.format(context=context, question=query)

# Step 5: Get response
chat = ChatOpenAI()
response = chat.invoke(prompt)

print(f"Q: {query}")
print(f"A: {response.content}")

Recap
#

We created a DocumentProcessor class to handle loading, chunking, embedding, and retrieval
We stored chunk_size, overlap, and vectorstore as state for clean, reusable logic
We showed how the chatbot becomes a client of this processor, retrieving context to build prompts

Coming Next: Part 4 – Building the Complete CLI Chatbot
#

In the final post, we’ll build the full CLI chatbot experience:

Accept user input
Retrieve matching context
Generate and stream model responses
Maintain history and state across turns

Stay tuned for the grand finale.

LangChain Document Intelligence - This article is part of a series.

Part : This Article

Part : Embedding and Vector Storage with LangChain (Part 2)

Part : Document Processing and Retrieval with LangChain in Python

Part 3: Building a Document Processor for RAG Chatbots#

Why a Document Processor Class?#

Core Design: Why We Store chunk_size and vectorstore as State#

Why store chunk_size and chunk_overlap in the class?#

Why not initialize vectorstore in the constructor?#

Implementation Overview#

Retrieval Method#

Reset Method#

Who Is the Client of This Class?#

Full Example Usage#

Recap#

Coming Next: Part 4 – Building the Complete CLI Chatbot#