Document Processing and Retrieval with LangChain in Python#
Welcome to the first article in our 4-part series on building intelligent document systems using LangChain and Python.
In this tutorial, you’ll learn how to:
- Load documents from PDFs, text files, and other formats
- Break them into manageable chunks
- Prepare them for embeddings and search
- Lay the foundation for retrieval-augmented generation (RAG) systems
Why Document Processing Matters#
Document processing is the first step in making your documents useful to AI systems. Whether you’re building a search engine, chatbot, or summarizer, you need to:
- Load the content correctly
- Break it into chunks
- Add meaning (via embeddings)
- Search through it when someone asks a question
In this post, we’ll cover steps 1 and 2: loading and chunking.
Getting Started: What You Need#
Install the following in your Python environment:
uv pip install langchain langchain-openai langchain-ollama pypdf
This installs LangChain and pypdf
, which helps extract text from PDFs. Note: this won’t work on scanned PDFs unless you also use OCR tools like Tesseract (not covered here).
Step 1: Load Documents with LangChain#
LangChain provides loaders to read content from different file types. Below are some examples:
Load a PDF#
from langchain_community.document_loaders import PyPDFLoader
pdf_loader = PyPDFLoader("document.pdf")
docs = pdf_loader.load()
Load a Text File#
from langchain_community.document_loaders import TextLoader
text_loader = TextLoader("document.txt")
docs = text_loader.load()
Load a DOCX or Other Format#
from langchain_community.document_loaders import UnstructuredFileLoader
general_loader = UnstructuredFileLoader("document.docx")
docs = general_loader.load()
LangChain also supports loaders for CSV, JSON, HTML, and web pages.
Step 2: Inspect the Document#
Once you load the file, you can view how it’s structured.
print(f"Loaded {len(docs)} chunks")
print(docs[0].page_content[:200]) # First 200 characters
print(docs[0].metadata) # Source info, page number, etc.
Each document chunk comes with metadata like:
- File path
- Page number
- Creation date
- Total pages
This helps you trace the content later and maintain structure.
Step 3: Why Splitting is Needed#
Large documents are too big for most AI models to handle in one go. We solve this by splitting the text into smaller chunks.
For example:
- Chatbots need small chunks for precise answers
- Summarizers need longer chunks to preserve meaning
Step 4: Split with RecursiveCharacterTextSplitter#
LangChain provides several splitters. Here’s a reliable one:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=100
)
split_docs = text_splitter.split_documents(docs)
This means:
- Each chunk is 1000 characters
- 100 characters overlap with the next chunk (helps keep context)
Example Output#
Loaded 8 document chunks
After splitting: 20 chunks
First chunk:
KHALID RIZVI
Solutions Architect – Generative AI & Cloud
...
Now you have multiple, manageable chunks—ready for embedding and retrieval.
How Chunking Works (Explained Simply)#
AI models can only read a limited amount of text at a time. So, we break long documents into smaller parts called “chunks.”
Chunk Size#
- Too small? You lose meaning.
- Too large? The model can’t read it all.
- Sweet spot: 500–1000 characters.
Overlap#
- A bit of repeated text between chunks helps preserve meaning.
- Helps when sentences span across chunks.
Real-World Example#
Suppose you’re building a resume chatbot. If Khalid’s AWS experience is split across chunks and there’s no overlap, the bot may miss it.
Overlap ensures it captures:
- “AWS Lambda, API Gateway…”
- “…and S3, DynamoDB, and CloudFormation”
Together, these give a full picture.
Summary#
In this post, you learned how to:
- Load PDFs and text files using LangChain
- Inspect content and metadata
- Split large documents into smaller chunks
This prepares your documents for vector search, embeddings, and retrieval systems like RAG.
Coming Next: Embedding and Vector Storage#
In Part 2, we’ll explore how to turn these chunks into embeddings and store them in a vector database like FAISS or Chroma.
Stay tuned.