Skip to main content

Document Processing and Retrieval with LangChain in Python

·616 words·3 mins
Khalid Rizvi
Author
Khalid Rizvi
Where Legacy Meets GenAI
LangChain Document Intelligence - This article is part of a series.
Part : This Article

Document Processing and Retrieval with LangChain in Python
#

Welcome to the first article in our 4-part series on building intelligent document systems using LangChain and Python.

In this tutorial, you’ll learn how to:

  • Load documents from PDFs, text files, and other formats
  • Break them into manageable chunks
  • Prepare them for embeddings and search
  • Lay the foundation for retrieval-augmented generation (RAG) systems

Why Document Processing Matters
#

Document processing is the first step in making your documents useful to AI systems. Whether you’re building a search engine, chatbot, or summarizer, you need to:

  1. Load the content correctly
  2. Break it into chunks
  3. Add meaning (via embeddings)
  4. Search through it when someone asks a question

In this post, we’ll cover steps 1 and 2: loading and chunking.


Getting Started: What You Need
#

Install the following in your Python environment:

uv pip install langchain langchain-openai langchain-ollama pypdf

This installs LangChain and pypdf, which helps extract text from PDFs. Note: this won’t work on scanned PDFs unless you also use OCR tools like Tesseract (not covered here).


Step 1: Load Documents with LangChain
#

LangChain provides loaders to read content from different file types. Below are some examples:

Load a PDF
#

from langchain_community.document_loaders import PyPDFLoader

pdf_loader = PyPDFLoader("document.pdf")
docs = pdf_loader.load()

Load a Text File
#

from langchain_community.document_loaders import TextLoader

text_loader = TextLoader("document.txt")
docs = text_loader.load()

Load a DOCX or Other Format
#

from langchain_community.document_loaders import UnstructuredFileLoader

general_loader = UnstructuredFileLoader("document.docx")
docs = general_loader.load()

LangChain also supports loaders for CSV, JSON, HTML, and web pages.


Step 2: Inspect the Document
#

Once you load the file, you can view how it’s structured.

print(f"Loaded {len(docs)} chunks")
print(docs[0].page_content[:200])     # First 200 characters
print(docs[0].metadata)               # Source info, page number, etc.

Each document chunk comes with metadata like:

  • File path
  • Page number
  • Creation date
  • Total pages

This helps you trace the content later and maintain structure.


Step 3: Why Splitting is Needed
#

Large documents are too big for most AI models to handle in one go. We solve this by splitting the text into smaller chunks.

For example:

  • Chatbots need small chunks for precise answers
  • Summarizers need longer chunks to preserve meaning

Step 4: Split with RecursiveCharacterTextSplitter
#

LangChain provides several splitters. Here’s a reliable one:

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100
)

split_docs = text_splitter.split_documents(docs)

This means:

  • Each chunk is 1000 characters
  • 100 characters overlap with the next chunk (helps keep context)

Example Output
#

Loaded 8 document chunks
After splitting: 20 chunks

First chunk:
KHALID RIZVI
Solutions Architect – Generative AI & Cloud
...

Now you have multiple, manageable chunks—ready for embedding and retrieval.


How Chunking Works (Explained Simply)
#

AI models can only read a limited amount of text at a time. So, we break long documents into smaller parts called “chunks.”

Chunk Size
#

  • Too small? You lose meaning.
  • Too large? The model can’t read it all.
  • Sweet spot: 500–1000 characters.

Overlap
#

  • A bit of repeated text between chunks helps preserve meaning.
  • Helps when sentences span across chunks.

Real-World Example
#

Suppose you’re building a resume chatbot. If Khalid’s AWS experience is split across chunks and there’s no overlap, the bot may miss it.

Overlap ensures it captures:

  • “AWS Lambda, API Gateway…”
  • “…and S3, DynamoDB, and CloudFormation”

Together, these give a full picture.


Summary
#

In this post, you learned how to:

  • Load PDFs and text files using LangChain
  • Inspect content and metadata
  • Split large documents into smaller chunks

This prepares your documents for vector search, embeddings, and retrieval systems like RAG.


Coming Next: Embedding and Vector Storage
#

In Part 2, we’ll explore how to turn these chunks into embeddings and store them in a vector database like FAISS or Chroma.

Stay tuned.

LangChain Document Intelligence - This article is part of a series.
Part : This Article