Building RAG Applications with Python: Complete Guide

The “Lost in the Middle” paper (Liu et al., 2023) proved what we already suspected: language models struggle with long contexts. Even Gemini’s 2 million context window doesn’t guarantee the model will use information buried in the middle of a prompt. The model’s attention degrades as context grows, leading to hallucinations and missed facts.

RAG (Retrieval Augmented Generation) is a pipeline that combines information retrieval with text generation. Instead of stuffing everything into a prompt, you retrieve only the relevant chunks from a knowledge base and feed those to the LLM. This grounds the model’s responses in actual data rather than letting it fabricate answers.

You’ll learn how to build production-ready RAG systems in Python, from basic document retrieval to advanced patterns like hybrid search and re-ranking. The code examples use LangChain and local embeddings, so you can test everything without API keys.

The problem with vanilla LLMs

Large language models have a fundamental limitation: they only know what was in their training data. Ask GPT-4 about your company’s internal documentation, and it will hallucinate plausible-sounding nonsense. The model has no access to information outside its training cutoff date.

You might think the solution is simple: paste your documentation into the prompt. This works for small documents, but breaks down quickly. Context windows have hard limits (GPT-4 Turbo supports 128k tokens, Claude 3.5 Sonnet supports 200k), and even within those limits, performance degrades.

The “Lost in the Middle” paper tested this empirically. Researchers placed a relevant fact at different positions in a long context and measured whether the model could retrieve it. Performance dropped dramatically for facts in the middle of the context, even though they were technically within the window. The model’s attention mechanism doesn’t distribute evenly across long inputs.

Cost is another factor. OpenAI charges per token, both input and output. A 100k token context costs $1.00 per request with GPT-4 Turbo. If you’re building a chatbot that answers questions about documentation, this adds up fast. You need a way to send only the relevant information, not the entire knowledge base.

Hallucination remains the core issue. Without grounding in factual data, LLMs will confidently generate incorrect information. They’re trained to predict plausible next tokens, not to verify truth. RAG solves this by retrieving actual documents and instructing the model to answer based only on the provided context.

What is RAG?

Retrieval Augmented Generation is a two-stage pipeline: retrieval followed by generation. You maintain a knowledge base of documents (your company’s docs, research papers, support tickets, whatever). When a user asks a question, you retrieve the most relevant documents and pass them to an LLM along with the question. The LLM generates an answer based on the retrieved context.

The original RAG paper (Lewis et al., 2020) introduced this architecture for knowledge-intensive NLP tasks. The authors combined a dense passage retriever with a seq2seq generator, showing that retrieval improved performance on open-domain question answering compared to models that relied solely on parametric knowledge.

RAG systems typically follow this flow:

Ingestion: Split documents into chunks, generate embeddings, store in a vector database
Retrieval: Convert the user’s question into an embedding, find similar chunks
Generation: Pass the question and retrieved chunks to an LLM, get an answer

The key insight is that you don’t need to fine-tune the LLM on your data. You just retrieve relevant information at query time and include it in the prompt. This makes RAG much cheaper and faster to implement than fine-tuning, especially when your knowledge base changes frequently.

Real-world use cases include documentation Q&A bots, customer support systems, research assistants, and legal document analysis. Any scenario where you need an LLM to answer questions about specific documents benefits from RAG. Tools like ChatGPT can be integrated into RAG pipelines for generation.

RAG architecture components

A production RAG system has five core components: document ingestion, text chunking, embedding generation, vector storage, and retrieval mechanisms. Each component has trade-offs you need to understand.

Document ingestion pipeline

You start with raw documents in various formats: PDFs, HTML, Markdown, Word docs. The ingestion pipeline extracts text from these formats and prepares it for chunking. Libraries like pypdf, beautifulsoup4, and python-docx handle format-specific extraction.

The challenge is preserving structure. A PDF might have tables, headers, and multi-column layouts. Naive text extraction produces garbage. You need parsers that understand document structure and convert it to clean text. Working with structured data is similar to using Pandas groupby for data transformation.

Text chunking strategies

Chunking splits long documents into smaller pieces that fit within embedding model limits (typically 512 tokens). The chunk size affects retrieval quality. Too large, and you dilute relevance with irrelevant context. Too small, and you lose important surrounding information.

Three common strategies exist:

Fixed-size chunking splits text every N characters or tokens. Simple but crude. It often breaks sentences mid-thought.

Recursive chunking tries to split on natural boundaries (paragraphs, sentences, words) while respecting a maximum size. This is what RecursiveCharacterTextSplitter in LangChain does.

Semantic chunking uses NLP to identify topic boundaries and splits there. More sophisticated but slower.

Overlap between chunks helps preserve context. If chunk 1 ends mid-paragraph and chunk 2 starts at the same paragraph, both chunks contain the full thought. A 10-20% overlap is typical.

Here’s a comparison of chunking strategies:

from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

# Sample Python documentation
documents = [
    Document(page_content="""
    Python's asyncio module provides infrastructure for writing single-threaded 
    concurrent code using coroutines. It was introduced in Python 3.4 and became 
    stable in Python 3.7. The async/await syntax makes asynchronous code more readable.
    """),
    Document(page_content="""
    Python's Global Interpreter Lock (GIL) prevents multiple native threads from 
    executing Python bytecode simultaneously. This means CPU-bound tasks don't 
    benefit from threading. Use multiprocessing for CPU-intensive work.
    """),
]

# Test different chunk sizes
strategies = [
    ("Small (100 chars)", RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=10)),
    ("Medium (200 chars)", RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=20)),
    ("Large (300 chars)", RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=30)),
]

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
query = "How do I handle CPU-intensive tasks in Python?"

for strategy_name, splitter in strategies:
    chunks = splitter.split_documents(documents)
    vectorstore = FAISS.from_documents(chunks, embeddings)
    results = vectorstore.similarity_search(query, k=1)
    
    print(f"{strategy_name}: {len(chunks)} chunks")
    print(f"Top result: {results[0].page_content[:100]}...")

The medium chunk size (200 characters) balances precision and context. Smaller chunks are more precise but may miss surrounding information. Larger chunks preserve context but dilute relevance.

Embedding generation

Embeddings convert text into dense vectors that capture semantic meaning. Similar texts produce similar vectors. You measure similarity using cosine distance or dot product.

Two main options exist for generating embeddings:

API-based embeddings (OpenAI’s text-embedding-3-large, Cohere’s embed models) are high-quality but cost money per request. OpenAI charges $0.13 per million tokens for text-embedding-3-large.

Local embeddings (Sentence Transformers, all-MiniLM-L6-v2) run on your hardware. Free but require GPU for reasonable speed. Quality varies by model.

The sentence-transformers/all-MiniLM-L6-v2 model is a good starting point. It produces 384-dimensional vectors and runs fast on CPU. For production, consider larger models like all-mpnet-base-v2 (768 dimensions, better quality) or OpenAI’s embeddings if you need maximum accuracy. Understanding NumPy arrays helps when working with embedding vectors.

Vector storage options

Vector databases store embeddings and enable fast similarity search. Three popular options:

FAISS (Facebook AI Similarity Search) is a library, not a database. It runs in-memory and provides extremely fast search using approximate nearest neighbor algorithms like HNSW (Hierarchical Navigable Small World). Good for prototyping and small datasets (under 1M vectors).

ChromaDB is an embedded vector database. It persists to disk and supports metadata filtering. Easy to set up, no separate server required. Suitable for small to medium datasets.

Pinecone is a managed cloud service. Handles billions of vectors, supports real-time updates, and provides low-latency search. Costs money but scales effortlessly.

For learning RAG, start with FAISS or ChromaDB. For production at scale, use Pinecone or Weaviate.

Retrieval mechanisms

The simplest retrieval method is cosine similarity search. You embed the user’s query, find the k nearest vectors in your database, and return those chunks. This works but has limitations.

Maximal Marginal Relevance (MMR) addresses redundancy. If your top 5 results are all from the same document section, they provide duplicate information. MMR balances relevance with diversity, returning results that are both similar to the query and dissimilar to each other.

Hybrid search combines keyword search (BM25) with vector search. Some queries benefit from exact keyword matching (product names, error codes). Hybrid search runs both methods and merges results.

Metadata filtering narrows search before computing similarity. If your knowledge base includes documents from multiple departments, you can filter to only search engineering docs before running vector search. This improves relevance and reduces compute.

Building a basic RAG system

Here’s a minimal RAG implementation using LangChain and local embeddings. No API keys required:

from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

# Sample documents about Python
documents = [
    Document(page_content="Python is a high-level programming language created by Guido van Rossum in 1991. It emphasizes code readability with significant whitespace."),
    Document(page_content="Python supports multiple programming paradigms including procedural, object-oriented, and functional programming."),
    Document(page_content="The Python Package Index (PyPI) hosts over 500,000 packages for various use cases."),
    Document(page_content="Python 3.12 introduced improved error messages and a new type parameter syntax for generic classes."),
    Document(page_content="Virtual environments in Python isolate project dependencies using tools like venv or virtualenv."),
]

# Step 1: Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=20
)
chunks = text_splitter.split_documents(documents)
print(f"Split {len(documents)} documents into {len(chunks)} chunks")

# Step 2: Create embeddings using local model
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

# Step 3: Create vector store
vectorstore = FAISS.from_documents(chunks, embeddings)

# Step 4: Perform similarity search
query = "What version of Python has improved error messages?"
results = vectorstore.similarity_search(query, k=2)

print(f"\nQuery: '{query}'")
print("\nTop 2 relevant chunks:")
for i, doc in enumerate(results, 1):
    print(f"\nChunk {i}: {doc.page_content}")

This example demonstrates the core RAG workflow: ingest documents, chunk them, generate embeddings, store in a vector database, and retrieve relevant chunks for a query.

The output shows the retrieval working correctly:

Split 5 documents into 5 chunks
Query: 'What version of Python has improved error messages?'

Top 2 relevant chunks:

Chunk 1: Python 3.12 introduced improved error messages and a new type parameter syntax for generic classes.

Chunk 2: Python is a high-level programming language created by Guido van Rossum in 1991. It emphasizes code readability with significant whitespace.

The first chunk directly answers the question. The second chunk is less relevant but still mentions Python, showing how similarity search works.

In a real RAG system, you would pass these chunks to an LLM along with the query. The LLM would generate an answer based on the retrieved context. Since this example uses only local components, you can test it without API keys.

Advanced RAG patterns

Production RAG systems need more sophistication than basic similarity search. Here are patterns that improve quality and reduce costs.

Re-ranking models

The initial retrieval step casts a wide net, returning maybe 20-50 candidate chunks. Re-ranking models score these candidates more carefully and select the top k.

Re-rankers use cross-encoders that process the query and each chunk together, producing a relevance score. This is more accurate than comparing embeddings but too slow to run on millions of vectors. The two-stage approach (fast retrieval, slow re-ranking) balances speed and quality.

Cohere’s Rerank API and open-source models like cross-encoder/ms-marco-MiniLM-L-6-v2 provide re-ranking. This typically improves answer quality by 10-20% at the cost of added latency.

Conversational RAG with memory

Basic RAG treats each query independently. Conversational RAG maintains context across multiple turns. If a user asks “What is Python?” and then “When was it created?”, the system needs to remember that “it” refers to Python.

LangChain’s ConversationBufferMemory stores chat history and includes it in the prompt. This works but consumes tokens quickly. ConversationSummaryMemory uses an LLM to summarize old messages, reducing token usage while preserving context.

The challenge is balancing memory size with cost. Storing 50 messages in the prompt costs tokens on every request. Summarization adds an extra LLM call. You need to tune this based on your use case.

Hybrid search combining keywords and vectors

Vector search excels at semantic similarity but fails on exact matches. If a user searches for “error code E4021”, keyword search (BM25) will find it immediately. Vector search might miss it if the embedding doesn’t capture the specific code.

Hybrid search runs both methods and merges results. Weaviate and Pinecone support this natively. For FAISS or ChromaDB, you implement it yourself using libraries like rank-bm25.

The typical approach: retrieve top 20 from vector search, top 20 from keyword search, merge and deduplicate, re-rank the combined set, return top 5.

Metadata filtering for multi-tenant systems

If your RAG system serves multiple customers, you need to ensure each customer only retrieves their own documents. Metadata filtering handles this.

When ingesting documents, attach metadata like {"customer_id": "acme-corp", "department": "engineering"}. At query time, filter the vector search to only consider chunks matching the customer ID.

ChromaDB and Pinecone support metadata filtering natively. FAISS requires you to implement it by maintaining separate indices per customer or filtering results post-retrieval.

Production considerations

Moving from prototype to production requires handling errors, optimizing costs, and monitoring performance.

Error handling and retries

LLM APIs fail. OpenAI returns 429 (rate limit) or 500 (server error) responses. Your RAG system needs retry logic with exponential backoff.

The tenacity library provides decorators for this:

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10)
)
def call_llm(prompt):
    # Your LLM API call here
    pass

This retries up to 3 times with exponential backoff (2s, 4s, 8s). Most transient errors resolve within a few retries.

Caching strategies

If multiple users ask the same question, you don’t need to re-run retrieval and generation. Cache the results.

Simple in-memory caching with functools.lru_cache works for single-server deployments. For distributed systems, use Redis or Memcached.

The cache key should include the query and any filters (customer ID, date range). Cache TTL depends on how often your knowledge base updates. For static documentation, cache for hours. For real-time data, cache for minutes.

Cost optimization

RAG costs come from three sources: embedding generation, vector database queries, and LLM API calls.

Embedding costs scale with document volume. If you’re using OpenAI embeddings at $0.13 per million tokens, a 10GB knowledge base might cost $50-100 to embed. Use local embeddings (Sentence Transformers) to eliminate this cost.

Vector database costs depend on the service. Pinecone charges per pod-hour. ChromaDB and FAISS are free but require you to manage infrastructure.

LLM costs scale with prompt size. Retrieving 10 chunks of 500 tokens each adds 5,000 tokens to every request. At $0.01 per 1k tokens (GPT-4 Turbo input), that’s $0.05 per query. Optimize by retrieving fewer, more relevant chunks.

Monitoring and logging

You need visibility into retrieval quality and LLM performance. Log every query, the retrieved chunks, and the generated answer. This lets you debug failures and identify patterns.

Key metrics to track:

Retrieval latency: How long does vector search take?
LLM latency: How long does generation take?
Retrieval relevance: Are the top chunks actually relevant? (requires manual labeling)
Answer quality: Does the LLM answer the question correctly? (requires manual review or automated eval)

Tools like LangSmith and Arize provide observability for LLM applications. They capture traces, log prompts and responses, and help you identify bottlenecks.

Real-world example: Documentation Q&A bot

Here’s a complete example of a RAG system for answering questions about Python documentation:

from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

# Simulate loading Python docs
python_docs = [
    Document(page_content="The asyncio module provides infrastructure for writing single-threaded concurrent code using coroutines, multiplexing I/O access over sockets and other resources."),
    Document(page_content="Type hints were introduced in PEP 484. They allow you to specify the expected types of function arguments and return values."),
    Document(page_content="The Global Interpreter Lock (GIL) is a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecode at once."),
    Document(page_content="F-strings provide a way to embed expressions inside string literals using curly braces. They were introduced in Python 3.6."),
    Document(page_content="List comprehensions provide a concise way to create lists. They consist of brackets containing an expression followed by a for clause."),
]

# Build RAG system
splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=30)
chunks = splitter.split_documents(python_docs)

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(chunks, embeddings)

# Query function
def ask_question(question):
    results = vectorstore.similarity_search(question, k=2)
    
    # In production, you'd pass results to an LLM here
    # For this example, we just return the retrieved chunks
    print(f"Question: {question}\n")
    print("Retrieved context:")
    for i, doc in enumerate(results, 1):
        print(f"{i}. {doc.page_content}\n")

# Test queries
ask_question("How do I write concurrent code in Python?")
ask_question("What are f-strings?")
ask_question("Why doesn't threading help with CPU-bound tasks?")

This example shows the full pipeline: load documents, chunk them, create embeddings, store in FAISS, and retrieve relevant chunks for queries.

In a production system, you would:

Load documents from a real source (file system, database, API)
Use a production vector database (Pinecone, Weaviate)
Pass retrieved chunks to an LLM (OpenAI, Anthropic)
Add error handling and retries
Implement caching
Log queries and responses for monitoring

Common pitfalls and solutions

RAG systems fail in predictable ways. Here’s what to watch for.

Chunk size too large or too small

Large chunks (1000+ tokens) preserve context but dilute relevance. The retrieval step might return a chunk containing the answer buried in irrelevant text. The LLM then has to find the needle in the haystack.

Small chunks (50-100 tokens) are precise but lose surrounding context. A chunk might say “It was released in 1991” without mentioning what “it” refers to.

Solution: Start with 200-500 tokens per chunk with 10-20% overlap. Test different sizes on your specific data and measure retrieval quality.

Poor embedding model choice

The embedding model determines retrieval quality. A model trained on general web text might perform poorly on domain-specific content (legal documents, medical records, code).

Solution: Use domain-specific embedding models when available. For code, try microsoft/codebert-base. For scientific text, try allenai/scibert. For general use, text-embedding-3-large from OpenAI provides strong performance.

No re-ranking

Similarity search returns approximate results. The top 5 chunks might not be the actual top 5 most relevant chunks. Re-ranking improves precision.

Solution: Retrieve 20-50 candidates with fast vector search, then re-rank using a cross-encoder. This two-stage approach balances speed and accuracy.

Ignoring metadata

If your knowledge base contains documents from different sources, time periods, or departments, metadata filtering improves relevance.

Solution: Attach metadata during ingestion ({"source": "engineering_docs", "date": "2024-01"}) and filter at query time. This reduces the search space and improves results.

Frequently asked questions

What’s the difference between RAG and fine-tuning?

Fine-tuning trains an LLM on your data, updating its weights. This teaches the model new patterns but doesn’t give it access to specific facts. RAG retrieves facts at query time without modifying the model. Use RAG when your data changes frequently or when you need citations. Use fine-tuning when you want to teach the model a new style or domain-specific reasoning.

Which vector database should I use?

For prototyping, use FAISS or ChromaDB. They’re free and easy to set up. For production at scale (millions of vectors, multiple users), use Pinecone or Weaviate. They handle infrastructure, scaling, and high availability.

How do I choose chunk size?

Test different sizes on your data. Retrieve chunks for sample queries and manually evaluate relevance. Typical range is 200-500 tokens. Technical documentation benefits from larger chunks (400-600 tokens) to preserve context. Chat logs or short-form content works better with smaller chunks (100-200 tokens).

Can RAG work with local LLMs?

Yes. Use Ollama to run Llama 3, Mistral, or other open-source models locally. The RAG pipeline is the same: retrieve chunks, pass them to the LLM. Local LLMs are slower and lower quality than GPT-4, but they’re free and private.

What’s the cost of running RAG in production?

Costs depend on scale. A small system (10k documents, 100 queries/day) might cost $20-50/month using OpenAI embeddings and GPT-4 Turbo. A large system (1M documents, 10k queries/day) could cost $500-2000/month. Use local embeddings and caching to reduce costs.

Conclusion

RAG solves the fundamental problem of grounding LLM responses in factual data. Instead of relying on the model’s parametric knowledge, you retrieve relevant documents and include them in the prompt. This reduces hallucinations, handles knowledge cutoff dates, and works with private data.

The core pipeline is straightforward: chunk documents, generate embeddings, store in a vector database, retrieve relevant chunks, and pass them to an LLM. Production systems add re-ranking, metadata filtering, caching, and monitoring.

Start with the basic examples in this guide using LangChain and local embeddings. Once you understand the architecture, move to production-grade components like Pinecone for vector storage and GPT-4 for generation. Test different chunking strategies and embedding models on your specific data.

RAG is not a silver bullet. It adds latency, costs money, and requires tuning. But for applications that need factual accuracy and access to private data, it’s the most practical approach available in 2026.