RAG. The original RAG paper by Lewis et al. (2020) laid the foundation for retrieval-augmented generation architectures. RAG is the most common production LLM pattern for a reason: it solves the two biggest problems with language models. Hallucination drops because the model generates answers from retrieved documents, not memorized training data. Knowledge stays current because you update the document corpus, not the model weights.
But most RAG systems underperform. Retrieval returns the wrong documents. Chunking splits context in the wrong places. Generation ignores retrieved content or hallucinate anyway. The gap between a tutorial RAG demo and a production RAG system is substantial.
This guide covers the full architecture, the tools that work, and the pitfalls that cause most systems to fail.
RAG Architecture Overview
A production RAG system has five components. Each one has architectural decisions that affect the whole system's quality.
1. Document Processing
Raw documents (PDFs, web pages, databases, emails, Slack messages) need to be converted into clean text chunks that a retrieval system can search.
Document parsing: Extract text from source formats while preserving structure.- PDFs: Unstructured, PyMuPDF, or Amazon Textract for complex layouts
- Web pages: Beautiful Soup, Scrapy, or Firecrawl for full site crawling
- Office documents: python-docx, openpyxl, python-pptx
- Databases: SQL queries with text field extraction
- Multi-modal: Document AI (Google), Azure Form Recognizer for documents with tables and images
2. Chunking
Chunking is how you split documents into pieces that the retrieval system searches. This decision has outsized impact on overall RAG quality. Bad chunking is the single most common cause of poor RAG performance.
Fixed-size chunking: Split text every N tokens with M token overlap.- Simple and predictable
- Good default: 512 tokens with 50-100 token overlap
- Problem: splits can land mid-paragraph or mid-sentence, breaking context
- Preserves meaning better than fixed-size
- Requires more processing (sentence boundary detection, topic modeling)
- Can produce variable-size chunks (some too small, some too large)
- Enables retrieval at the right granularity for each query
- More complex to implement and store
- Best for documents with clear hierarchical structure
- Smaller chunks (128-256 tokens): Better precision (retrieved chunk is highly relevant), but may miss surrounding context
- Larger chunks (512-1024 tokens): More context per retrieval, but lower precision (may include irrelevant text)
- Very large chunks (1024+ tokens): Risk retrieving mostly irrelevant text with a small relevant section
3. Embedding Generation
Convert text chunks into numerical vector. NIST AI guidelines cover evaluation standards for AI systems that incorporate retrieval components. Vectors that capture semantic meaning. Similar texts produce similar vectors, enabling semantic search.
Model selection:- OpenAI text-embedding-3-small (1536 dim): Good default, $0.02/1M tokens
- OpenAI text-embedding-3-large (3072 dim): Higher quality, 2x cost
- Cohere embed-v3 (1024 dim): Best multilingual performance
- BGE-M3 (open-source, 1024 dim): Best free option, runs locally
- Nomic embed-text-v1.5 (768 dim): Efficient open-source option
4. Retrieval
Given a user query, find the most relevant chunks from your corpus. This is where most RAG systems succeed or fail.
Semantic search: Convert the query to a vector, search for the nearest vectors in your database. Finds conceptually similar content even when exact keywords don't match. Keyword search (BM25): Traditional text search. Finds exact keyword matches. Important for queries with specific terms, names, identifiers, or codes that semantic search may not capture. Hybrid search: Combine semantic and keyword search results. This is the recommended approach for production systems. Hybrid search improves recall by 20-30% over semantic search alone.Implementation: Run both searches, normalize scores to the same range, and combine with a weighted average (typically 0.7 semantic + 0.3 keyword, but tune for your data).
Reranking: After initial retrieval, use a cross-encoder model to re-score results. Cross-encoders are more accurate than embedding similarity but too slow for initial retrieval on large corpora. Use them as a second stage on the top 20-50 results.Reranking models: Cohere Rerank, bge-reranker, ColBERT.
Retrieval parameters:- Top-K: Number of chunks to retrieve. Start with 5-10. More isn't always better. Irrelevant chunks dilute the context and can confuse the generator.
- Similarity threshold: Minimum similarity score to include a result. Prevents low-relevance chunks from entering the context.
- Maximum context length: Total tokens sent to the generator. Balance between providing enough context and staying within the model's effective context window.
5. Generation
The LLM generates an answer based on the retrieved context and the user's query.
Prompt structure: ``
System: You are a helpful assistant. Answer questions based on the provided context. If the context doesn't contain enough information to answer the question, say so.
Context:
[Retrieved chunk 1]
[Retrieved chunk 2]
[Retrieved chunk 3]
User: [Query]
``
Key prompt design decisions:
- Always instruct the model to use only the provided context (reduces hallucination)
- Tell the model what to do when context is insufficient (say "I don't know" rather than guess)
- Request citations or source references if needed
- Specify output format (paragraph, bullet points, structured data)
- GPT-4o: Highest quality, most expensive ($2.50/$10.00 per 1M input/output tokens)
- Claude 3.5 Sonnet: Strong quality, good at following instructions ($3.00/$15.00)
- GPT-4o-mini: Good quality at 10x lower cost ($0.15/$0.60)
- Open-source 7B models: Lowest cost per query when self-hosted, adequate for many tasks
Evaluation Framework
Without systematic evaluation, you're guessing at RAG quality. Build an evaluation pipeline before optimizing anything else.
Retrieval Evaluation
Retrieval precision: Of the chunks retrieved, how many are relevant? Measure by having domain experts label query-chunk pairs. Retrieval recall: Of all relevant chunks in the corpus, how many were retrieved? Requires a ground truth set of relevant documents for each query. Mean Reciprocal Rank (MRR): Where does the first relevant chunk appear in the ranked results? Higher is better.Generation Evaluation
Faithfulness: Does the answer stick to information in the retrieved context, or does it hallucinate? The most critical metric for production RAG. Answer relevance: Does the answer address the question? A faithful answer that's off-topic is still useless. Completeness: Does the answer cover all aspects of the question using the available context?Automated Evaluation Tools
Ragas: The most popular RAG evaluation framework. Provides automated metrics for faithfulness, relevance, and context precision. Good for continuous monitoring but should be supplemented with human evaluation. DeepEval: General LLM evaluation framework with RAG-specific metrics. Custom evaluation: Build your own evaluation set of 50-100 questions with ground truth answers. Run it after every change to the RAG pipeline. This is the most reliable approach but requires upfront investment.Evaluation Cadence
- After every architectural change (new embedding model, chunking strategy, retrieval method): Full evaluation run
- Weekly: Automated metrics on a sample of production queries
- Monthly: Human evaluation on 50-100 queries covering the full range of topics
Production Architecture
Basic RAG (MVP)
User query > Embed query > Vector search > Top-K chunks > LLM generation > Response
Components: One embedding model, one vector database, one LLM. This handles 80% of RAG use cases adequately.
Advanced RAG
User query > Query analysis (intent, entities) > Hybrid retrieval (semantic + keyword) > Reranking > Context compression > LLM generation with citation extraction > Response with sources
Additional components: Query classifier, BM25 index, reranker model, context compression module, citation extractor.
Agentic RAG
User query > Planning (decompose into sub-queries) > Multi-step retrieval (different sources per sub-query) > Synthesis (combine sub-answers) > Verification (check consistency) > Response
This pattern handles complex queries that require information from multiple sources. It's slower and more expensive but handles questions that basic RAG can't.
Common Tools and Stack Recommendations
Orchestration
- LangChain: Most popular, extensive ecosystem, good for prototyping. Can be verbose for simple use cases.
- LlamaIndex: Purpose-built for RAG. Cleaner abstractions for document processing and retrieval. Best choice for RAG-focused applications.
- Custom code: For production systems where you need full control. Use libraries directly (OpenAI SDK, vector database client) without an orchestration layer.
Vector Databases
- Under 1M documents: pgvector or Chroma
- 1M-10M documents: Qdrant or Weaviate
- 10M+ documents: Pinecone or Milvus
- Need hybrid search: Weaviate or Qdrant
Document Processing
- Unstructured: The most comprehensive document processing library. Handles PDFs, Word docs, HTML, images, and more.
- LlamaParse: Document parsing service from the LlamaIndex team. Strong PDF parsing with table extraction.
- PyMuPDF: Fast, lightweight PDF extraction. Good when you need speed over accuracy on clean PDFs.
Monitoring
- LangSmith: Best tracing and debugging for LangChain-based systems. Shows every step of the RAG pipeline.
- Phoenix (Arize): Open-source LLM observability. Good for tracking retrieval quality and generation metrics over time.
- Custom logging: At minimum, log every query, the retrieved chunks, and the generated response. This enables debugging and evaluation.
The Five Most Common RAG Failures
1. Wrong Chunk Size
Chunks too large: retrieved context contains mostly irrelevant text, confusing the generator. Chunks too small: relevant information is split across chunks, and only one piece gets retrieved.
Fix: Experiment with 256, 512, and 1024 token chunks on your evaluation set. Measure retrieval precision and answer quality at each size. Most systems land between 256-512 tokens.
2. No Hybrid Search
Pure semantic search misses exact keyword matches. A query for "What is policy HR-2024-03?" won't find the document if semantic similarity between the query and the policy content is low. Adding BM25 keyword search catches these cases.
Fix: Implement hybrid search with a 0.7/0.3 semantic/keyword weight. Adjust weights based on your query distribution.
3. Poor Metadata Filtering
When your corpus spans multiple document types, time periods, or domains, retrieving across the entire corpus produces irrelevant results. A question about 2025 policies shouldn't return 2020 documents.
Fix: Attach metadata (date, document type, department, version) to every chunk. Use metadata filters in retrieval queries to narrow the search space.
4. No Retrieval Failure Handling
When the vector database returns no relevant chunks (low similarity scores), most RAG systems pass empty or irrelevant context to the generator, which then hallucinated an answer.
Fix: Set a similarity threshold. If no chunks exceed the threshold, return a clear "I don't have enough information to answer this question" response instead of generating from empty context.
5. No Evaluation Pipeline
Without measurement, you don't know if changes improve or degrade quality. Teams make changes based on intuition, break things they don't notice, and ship poor-quality systems.
Fix: Build an evaluation set of 50-100 queries with known good answers before you optimize anything. Run it after every change. Automate it.
Getting Started Checklist
- Define your use case and success metrics (what questions will users ask? what quality is acceptable?)
- Collect and process your document corpus
- Choose chunking strategy (start with 512 tokens, 100 overlap)
- Select embedding model and vector database (start simple: text-embedding-3-small + pgvector or Chroma)
- Build basic retrieval pipeline (semantic search, top-5)
- Build generation pipeline (GPT-4o-mini with context-grounded prompt)
- Create evaluation set (50+ queries with expected answers)
- Measure baseline performance
- Iterate: add hybrid search, tune chunk size, add reranking, improve prompts
- Deploy with monitoring (log every query, retrieval, and response)
Cost of a Production RAG System
Minimal RAG (small corpus, low traffic):- Vector database: $0-$65/month
- Embeddings: $5-$50/month
- LLM generation: $50-$500/month
- Total: $55-$615/month
- Vector database: $65-$500/month
- Embeddings: $50-$200/month
- LLM generation: $500-$5,000/month
- Reranker: $100-$500/month
- Monitoring: $50-$200/month
- Total: $765-$6,400/month
- Vector database: $500-$5,000/month
- Embeddings: $200-$2,000/month
- LLM generation: $5,000-$50,000/month
- Reranker: $500-$2,000/month
- Monitoring: $200-$1,000/month
- Infrastructure: $1,000-$10,000/month
- Total: $7,400-$70,000/month