RAG. The original RAG paper by Lewis et al. (2020) laid the foundation for retrieval-augmented generation architectures. RAG is the most common production LLM pattern for a reason: it solves the two biggest problems with language models. Hallucination drops because the model generates answers from retrieved documents, not memorized training data. Knowledge stays current because you update the document corpus, not the model weights.

But most RAG systems underperform. Retrieval returns the wrong documents. Chunking splits context in the wrong places. Generation ignores retrieved content or hallucinate anyway. The gap between a tutorial RAG demo and a production RAG system is substantial.

This guide covers the full architecture, the tools that work, and the pitfalls that cause most systems to fail.

RAG Architecture Overview

AI market intelligence showing trends, funding, and hiring velocity

A production RAG system has five components. Each one has architectural decisions that affect the whole system's quality.

1. Document Processing

Raw documents (PDFs, web pages, databases, emails, Slack messages) need to be converted into clean text chunks that a retrieval system can search.

Document parsing: Extract text from source formats while preserving structure.
  • PDFs: Unstructured, PyMuPDF, or Amazon Textract for complex layouts
  • Web pages: Beautiful Soup, Scrapy, or Firecrawl for full site crawling
  • Office documents: python-docx, openpyxl, python-pptx
  • Databases: SQL queries with text field extraction
  • Multi-modal: Document AI (Google), Azure Form Recognizer for documents with tables and images
Text cleaning: Remove headers, footers, boilerplate, navigation text, and formatting artifacts. Normalize whitespace, fix encoding issues, and handle special characters. This unglamorous step prevents significant retrieval errors downstream. Metadata extraction: Title, author, date, section hierarchy, document type, and any domain-specific metadata. Good metadata enables filtered retrieval (searching only within a specific document type, date range, or category).

2. Chunking

Chunking is how you split documents into pieces that the retrieval system searches. This decision has outsized impact on overall RAG quality. Bad chunking is the single most common cause of poor RAG performance.

Fixed-size chunking: Split text every N tokens with M token overlap.
  • Simple and predictable
  • Good default: 512 tokens with 50-100 token overlap
  • Problem: splits can land mid-paragraph or mid-sentence, breaking context
Semantic chunking: Split at natural boundaries (paragraphs, sections, topic changes).
  • Preserves meaning better than fixed-size
  • Requires more processing (sentence boundary detection, topic modeling)
  • Can produce variable-size chunks (some too small, some too large)
Hierarchical chunking: Create chunks at multiple levels (document, section, paragraph, sentence) and link them.
  • Enables retrieval at the right granularity for each query
  • More complex to implement and store
  • Best for documents with clear hierarchical structure
Recommended approach: Start with fixed-size chunking at 512 tokens with 100-token overlap. This works well for 80% of use cases. Switch to semantic chunking if you see retrieval quality issues where relevant content is split across chunks. Chunk size tradeoffs:
  • Smaller chunks (128-256 tokens): Better precision (retrieved chunk is highly relevant), but may miss surrounding context
  • Larger chunks (512-1024 tokens): More context per retrieval, but lower precision (may include irrelevant text)
  • Very large chunks (1024+ tokens): Risk retrieving mostly irrelevant text with a small relevant section

3. Embedding Generation

Convert text chunks into numerical vector. NIST AI guidelines cover evaluation standards for AI systems that incorporate retrieval components. Vectors that capture semantic meaning. Similar texts produce similar vectors, enabling semantic search.

Model selection:
  • OpenAI text-embedding-3-small (1536 dim): Good default, $0.02/1M tokens
  • OpenAI text-embedding-3-large (3072 dim): Higher quality, 2x cost
  • Cohere embed-v3 (1024 dim): Best multilingual performance
  • BGE-M3 (open-source, 1024 dim): Best free option, runs locally
  • Nomic embed-text-v1.5 (768 dim): Efficient open-source option
Batch processing: Always embed in batches (100-1,000 texts at once). Single-text embedding is 10-100x slower due to API overhead and lack of GPU batching. Pre-computation: Embed all documents during ingestion. Only re-embed when document content changes. Store embeddings in your vector database alongside the text and metadata.

4. Retrieval

Given a user query, find the most relevant chunks from your corpus. This is where most RAG systems succeed or fail.

Semantic search: Convert the query to a vector, search for the nearest vectors in your database. Finds conceptually similar content even when exact keywords don't match. Keyword search (BM25): Traditional text search. Finds exact keyword matches. Important for queries with specific terms, names, identifiers, or codes that semantic search may not capture. Hybrid search: Combine semantic and keyword search results. This is the recommended approach for production systems. Hybrid search improves recall by 20-30% over semantic search alone.

Implementation: Run both searches, normalize scores to the same range, and combine with a weighted average (typically 0.7 semantic + 0.3 keyword, but tune for your data).

Reranking: After initial retrieval, use a cross-encoder model to re-score results. Cross-encoders are more accurate than embedding similarity but too slow for initial retrieval on large corpora. Use them as a second stage on the top 20-50 results.

Reranking models: Cohere Rerank, bge-reranker, ColBERT.

Retrieval parameters:
  • Top-K: Number of chunks to retrieve. Start with 5-10. More isn't always better. Irrelevant chunks dilute the context and can confuse the generator.
  • Similarity threshold: Minimum similarity score to include a result. Prevents low-relevance chunks from entering the context.
  • Maximum context length: Total tokens sent to the generator. Balance between providing enough context and staying within the model's effective context window.

5. Generation

The LLM generates an answer based on the retrieved context and the user's query.

Prompt structure: `` System: You are a helpful assistant. Answer questions based on the provided context. If the context doesn't contain enough information to answer the question, say so.

Context: [Retrieved chunk 1] [Retrieved chunk 2] [Retrieved chunk 3]

User: [Query] ``

Key prompt design decisions:
  • Always instruct the model to use only the provided context (reduces hallucination)
  • Tell the model what to do when context is insufficient (say "I don't know" rather than guess)
  • Request citations or source references if needed
  • Specify output format (paragraph, bullet points, structured data)
Model selection for generation:
  • GPT-4o: Highest quality, most expensive ($2.50/$10.00 per 1M input/output tokens)
  • Claude 3.5 Sonnet: Strong quality, good at following instructions ($3.00/$15.00)
  • GPT-4o-mini: Good quality at 10x lower cost ($0.15/$0.60)
  • Open-source 7B models: Lowest cost per query when self-hosted, adequate for many tasks
For most production RAG, GPT-4o-mini or a comparable model provides the best quality-per-dollar ratio.

Evaluation Framework

Without systematic evaluation, you're guessing at RAG quality. Build an evaluation pipeline before optimizing anything else.

Retrieval Evaluation

Retrieval precision: Of the chunks retrieved, how many are relevant? Measure by having domain experts label query-chunk pairs. Retrieval recall: Of all relevant chunks in the corpus, how many were retrieved? Requires a ground truth set of relevant documents for each query. Mean Reciprocal Rank (MRR): Where does the first relevant chunk appear in the ranked results? Higher is better.

Generation Evaluation

Faithfulness: Does the answer stick to information in the retrieved context, or does it hallucinate? The most critical metric for production RAG. Answer relevance: Does the answer address the question? A faithful answer that's off-topic is still useless. Completeness: Does the answer cover all aspects of the question using the available context?

Automated Evaluation Tools

Ragas: The most popular RAG evaluation framework. Provides automated metrics for faithfulness, relevance, and context precision. Good for continuous monitoring but should be supplemented with human evaluation. DeepEval: General LLM evaluation framework with RAG-specific metrics. Custom evaluation: Build your own evaluation set of 50-100 questions with ground truth answers. Run it after every change to the RAG pipeline. This is the most reliable approach but requires upfront investment.

Evaluation Cadence

  • After every architectural change (new embedding model, chunking strategy, retrieval method): Full evaluation run
  • Weekly: Automated metrics on a sample of production queries
  • Monthly: Human evaluation on 50-100 queries covering the full range of topics

Production Architecture

Basic RAG (MVP)

User query > Embed query > Vector search > Top-K chunks > LLM generation > Response

Components: One embedding model, one vector database, one LLM. This handles 80% of RAG use cases adequately.

Advanced RAG

User query > Query analysis (intent, entities) > Hybrid retrieval (semantic + keyword) > Reranking > Context compression > LLM generation with citation extraction > Response with sources

Additional components: Query classifier, BM25 index, reranker model, context compression module, citation extractor.

Agentic RAG

User query > Planning (decompose into sub-queries) > Multi-step retrieval (different sources per sub-query) > Synthesis (combine sub-answers) > Verification (check consistency) > Response

This pattern handles complex queries that require information from multiple sources. It's slower and more expensive but handles questions that basic RAG can't.

Common Tools and Stack Recommendations

Orchestration

  • LangChain: Most popular, extensive ecosystem, good for prototyping. Can be verbose for simple use cases.
  • LlamaIndex: Purpose-built for RAG. Cleaner abstractions for document processing and retrieval. Best choice for RAG-focused applications.
  • Custom code: For production systems where you need full control. Use libraries directly (OpenAI SDK, vector database client) without an orchestration layer.

Vector Databases

  • Under 1M documents: pgvector or Chroma
  • 1M-10M documents: Qdrant or Weaviate
  • 10M+ documents: Pinecone or Milvus
  • Need hybrid search: Weaviate or Qdrant
See our Vector Database Selection Guide for detailed comparison.

Document Processing

  • Unstructured: The most comprehensive document processing library. Handles PDFs, Word docs, HTML, images, and more.
  • LlamaParse: Document parsing service from the LlamaIndex team. Strong PDF parsing with table extraction.
  • PyMuPDF: Fast, lightweight PDF extraction. Good when you need speed over accuracy on clean PDFs.

Monitoring

  • LangSmith: Best tracing and debugging for LangChain-based systems. Shows every step of the RAG pipeline.
  • Phoenix (Arize): Open-source LLM observability. Good for tracking retrieval quality and generation metrics over time.
  • Custom logging: At minimum, log every query, the retrieved chunks, and the generated response. This enables debugging and evaluation.

The Five Most Common RAG Failures

1. Wrong Chunk Size

Chunks too large: retrieved context contains mostly irrelevant text, confusing the generator. Chunks too small: relevant information is split across chunks, and only one piece gets retrieved.

Fix: Experiment with 256, 512, and 1024 token chunks on your evaluation set. Measure retrieval precision and answer quality at each size. Most systems land between 256-512 tokens.

2. No Hybrid Search

Pure semantic search misses exact keyword matches. A query for "What is policy HR-2024-03?" won't find the document if semantic similarity between the query and the policy content is low. Adding BM25 keyword search catches these cases.

Fix: Implement hybrid search with a 0.7/0.3 semantic/keyword weight. Adjust weights based on your query distribution.

3. Poor Metadata Filtering

When your corpus spans multiple document types, time periods, or domains, retrieving across the entire corpus produces irrelevant results. A question about 2025 policies shouldn't return 2020 documents.

Fix: Attach metadata (date, document type, department, version) to every chunk. Use metadata filters in retrieval queries to narrow the search space.

4. No Retrieval Failure Handling

When the vector database returns no relevant chunks (low similarity scores), most RAG systems pass empty or irrelevant context to the generator, which then hallucinated an answer.

Fix: Set a similarity threshold. If no chunks exceed the threshold, return a clear "I don't have enough information to answer this question" response instead of generating from empty context.

5. No Evaluation Pipeline

Without measurement, you don't know if changes improve or degrade quality. Teams make changes based on intuition, break things they don't notice, and ship poor-quality systems.

Fix: Build an evaluation set of 50-100 queries with known good answers before you optimize anything. Run it after every change. Automate it.

Getting Started Checklist

  1. Define your use case and success metrics (what questions will users ask? what quality is acceptable?)
  2. Collect and process your document corpus
  3. Choose chunking strategy (start with 512 tokens, 100 overlap)
  4. Select embedding model and vector database (start simple: text-embedding-3-small + pgvector or Chroma)
  5. Build basic retrieval pipeline (semantic search, top-5)
  6. Build generation pipeline (GPT-4o-mini with context-grounded prompt)
  7. Create evaluation set (50+ queries with expected answers)
  8. Measure baseline performance
  9. Iterate: add hybrid search, tune chunk size, add reranking, improve prompts
  10. Deploy with monitoring (log every query, retrieval, and response)
The entire basic pipeline can be built in 2-3 days. Getting it to production quality typically takes 2-4 weeks of iteration guided by evaluation metrics.

Cost of a Production RAG System

Minimal RAG (small corpus, low traffic):
  • Vector database: $0-$65/month
  • Embeddings: $5-$50/month
  • LLM generation: $50-$500/month
  • Total: $55-$615/month
Standard RAG (medium corpus, moderate traffic):
  • Vector database: $65-$500/month
  • Embeddings: $50-$200/month
  • LLM generation: $500-$5,000/month
  • Reranker: $100-$500/month
  • Monitoring: $50-$200/month
  • Total: $765-$6,400/month
Enterprise RAG (large corpus, high traffic, multiple domains):
  • Vector database: $500-$5,000/month
  • Embeddings: $200-$2,000/month
  • LLM generation: $5,000-$50,000/month
  • Reranker: $500-$2,000/month
  • Monitoring: $200-$1,000/month
  • Infrastructure: $1,000-$10,000/month
  • Total: $7,400-$70,000/month
The biggest cost lever is the LLM generation step. Model routing (cheaper models for simple queries) and caching (skip generation for repeated queries) produce the largest cost reductions.

Frequently Asked Questions

Based on our analysis of 37,339 AI job postings, demand for AI engineers keeps growing. The most in-demand skills include Python, RAG systems, and LLM frameworks like LangChain.
Based on our job market analysis, the most requested skills include: Python, RAG (Retrieval-Augmented Generation), LangChain, AWS, and experience with production ML systems. Rust is emerging as a valuable skill for performance-critical AI applications.
We collect data from major job boards and company career pages, tracking AI, ML, and prompt engineering roles. Our database is updated weekly and includes only verified job postings with disclosed requirements.
Retrieval-Augmented Generation (RAG) combines a retrieval system (searching a knowledge base for relevant documents) with a generative model (producing answers based on retrieved context). It matters because it solves two LLM problems: hallucination (the model cites real sources instead of making things up) and knowledge cutoff (the retrieval corpus can be updated without retraining the model).
Five components: document processing (chunking, cleaning, metadata extraction), embedding generation (converting text to vectors), vector storage (database for similarity search), retrieval (finding relevant chunks for a query), and generation (producing an answer from retrieved context). Each component has multiple tool options and architectural decisions that affect overall quality.
Minimum viable stack: LangChain or LlamaIndex for orchestration, OpenAI or Cohere for embeddings, Chroma or pgvector for vector storage, and GPT-4 or Claude for generation. Production stack adds: Unstructured for document processing, a dedicated vector database (Qdrant, Weaviate, or Pinecone), an evaluation framework (Ragas or custom), and observability (LangSmith or Phoenix).
Measure four dimensions: retrieval precision (are the retrieved chunks relevant?), retrieval recall (are all relevant chunks found?), answer faithfulness (does the answer stick to retrieved context?), and answer relevance (does the answer address the question?). Use frameworks like Ragas for automated evaluation. Always include human evaluation on a random sample. Track metrics over time to catch regression.
Five mistakes that cause most RAG failures: chunking too large or too small (experiment with 256-1024 token chunks), not using hybrid search (combining semantic and keyword retrieval improves recall 20-30%), skipping metadata filtering (retrieving from the wrong document section), not evaluating systematically (relying on vibes instead of metrics), and not handling retrieval failures (when no good chunks are found, the system should say so instead of hallucinating).
RT

About the Author

Founder, AI Pulse

Rome Thorndike is the founder of AI Pulse, a career intelligence platform for AI professionals. He tracks the AI job market through analysis of thousands of active job postings, providing data-driven insights on salaries, skills, and hiring trends.

Connect on LinkedIn →

Get Weekly AI Career Insights

Join our newsletter for AI job market trends, salary data, and career guidance.

Get AI Career Intel

Weekly salary data, skills demand, and market signals from 16,000+ AI job postings.

Free weekly email. Unsubscribe anytime.