How many AI engineering jobs are available in 2026?

Based on our analysis of 3,824 AI job postings, demand for AI engineers keeps growing. The most in-demand skills include Python, RAG systems, and LLM frameworks like LangChain.

What skills are most in-demand for AI roles?

Based on our job market analysis, the most requested skills include: Python, RAG (Retrieval-Augmented Generation), LangChain, AWS, and experience with production ML systems. Rust is emerging as a valuable skill for performance-critical AI applications.

How is this data collected?

We collect data from major job boards and company career pages, tracking AI, ML, and prompt engineering roles. Our database is updated weekly and includes only verified job postings with disclosed requirements.

What is RAG and why does it matter?

Retrieval-Augmented Generation (RAG) combines a retrieval system (searching a knowledge base for relevant documents) with a generative model (producing answers based on retrieved context). It matters because it solves two LLM problems: hallucination (the model cites real sources instead of making things up) and knowledge cutoff (the retrieval corpus can be updated without retraining the model).

What are the main components of a RAG system?

Five components: document processing (chunking, cleaning, metadata extraction), embedding generation (converting text to vectors), vector storage (database for similarity search), retrieval (finding relevant chunks for a query), and generation (producing an answer from retrieved context). Each component has multiple tool options and architectural decisions that affect overall quality.

What tools do I need for RAG?

Minimum viable stack: LangChain or LlamaIndex for orchestration, OpenAI or Cohere for embeddings, Chroma or pgvector for vector storage, and GPT-4 or Claude for generation. Production stack adds: Unstructured for document processing, a dedicated vector database (Qdrant, Weaviate, or Pinecone), an evaluation framework (Ragas or custom), and observability (LangSmith or Phoenix).

How do I evaluate RAG system quality?

Measure four dimensions: retrieval precision (are the retrieved chunks relevant?), retrieval recall (are all relevant chunks found?), answer faithfulness (does the answer stick to retrieved context?), and answer relevance (does the answer address the question?). Use frameworks like Ragas for automated evaluation. Always include human evaluation on a random sample. Track metrics over time to catch regression.

What are the most common RAG mistakes?

Five mistakes that cause most RAG failures: chunking too large or too small (experiment with 256-1024 token chunks), not using hybrid search (combining semantic and keyword retrieval improves recall 20-30%), skipping metadata filtering (retrieving from the wrong document section), not evaluating systematically (relying on vibes instead of metrics), and not handling retrieval failures (when no good chunks are found, the system should say so instead of hallucinating).

RAG Implementation Guide: Architecture and Tools

RAG. The original RAG paper by Lewis et al. (2020) laid the foundation for retrieval-augmented generation architectures. RAG is the most common production LLM pattern for a reason: it solves the two biggest problems with language models. Hallucination drops because the model generates answers from retrieved documents, not memorized training data. Knowledge stays current because you update the document corpus, not the model weights.

But most RAG systems underperform. Retrieval returns the wrong documents. Chunking splits context in the wrong places. Generation ignores retrieved content or hallucinate anyway. The gap between a tutorial RAG demo and a production RAG system is substantial.

This guide covers the full architecture, the tools that work, and the pitfalls that cause most systems to fail.

RAG Architecture Overview

AI market intelligence showing trends, funding, and hiring velocity

A production RAG system has five components. Each one has architectural decisions that affect the whole system's quality.

1. Document Processing

Raw documents (PDFs, web pages, databases, emails, Slack messages) need to be converted into clean text chunks that a retrieval system can search.

Document parsing: Extract text from source formats while preserving structure.

PDFs: Unstructured, PyMuPDF, or Amazon Textract for complex layouts
Web pages: Beautiful Soup, Scrapy, or Firecrawl for full site crawling
Office documents: python-docx, openpyxl, python-pptx
Databases: SQL queries with text field extraction
Multi-modal: Document AI (Google), Azure Form Recognizer for documents with tables and images

Text cleaning: Remove headers, footers, boilerplate, navigation text, and formatting artifacts. Normalize whitespace, fix encoding issues, and handle special characters. This unglamorous step prevents significant retrieval errors downstream. Metadata extraction: Title, author, date, section hierarchy, document type, and any domain-specific metadata. Good metadata enables filtered retrieval (searching only within a specific document type, date range, or category).

2. Chunking

Chunking is how you split documents into pieces that the retrieval system searches. This decision has outsized impact on overall RAG quality. Bad chunking is the single most common cause of poor RAG performance.

Fixed-size chunking: Split text every N tokens with M token overlap.

Simple and predictable
Good default: 512 tokens with 50-100 token overlap
Problem: splits can land mid-paragraph or mid-sentence, breaking context

Semantic chunking: Split at natural boundaries (paragraphs, sections, topic changes).

Preserves meaning better than fixed-size
Requires more processing (sentence boundary detection, topic modeling)
Can produce variable-size chunks (some too small, some too large)

Hierarchical chunking: Create chunks at multiple levels (document, section, paragraph, sentence) and link them.

Enables retrieval at the right granularity for each query
More complex to implement and store
Best for documents with clear hierarchical structure

Recommended approach: Start with fixed-size chunking at 512 tokens with 100-token overlap. This works well for 80% of use cases. Switch to semantic chunking if you see retrieval quality issues where relevant content is split across chunks. Chunk size tradeoffs:

Smaller chunks (128-256 tokens): Better precision (retrieved chunk is highly relevant), but may miss surrounding context
Larger chunks (512-1024 tokens): More context per retrieval, but lower precision (may include irrelevant text)
Very large chunks (1024+ tokens): Risk retrieving mostly irrelevant text with a small relevant section

3. Embedding Generation

Convert text chunks into numerical vector. NIST AI guidelines cover evaluation standards for AI systems that incorporate retrieval components. Vectors that capture semantic meaning. Similar texts produce similar vectors, enabling semantic search.

Model selection:

OpenAI text-embedding-3-small (1536 dim): Good default, $0.02/1M tokens
OpenAI text-embedding-3-large (3072 dim): Higher quality, 2x cost
Cohere embed-v3 (1024 dim): Best multilingual performance
BGE-M3 (open-source, 1024 dim): Best free option, runs locally
Nomic embed-text-v1.5 (768 dim): Efficient open-source option

Batch processing: Always embed in batches (100-1,000 texts at once). Single-text embedding is 10-100x slower due to API overhead and lack of GPU batching. Pre-computation: Embed all documents during ingestion. Only re-embed when document content changes. Store embeddings in your vector database alongside the text and metadata.

4. Retrieval

Given a user query, find the most relevant chunks from your corpus. This is where most RAG systems succeed or fail.

Semantic search: Convert the query to a vector, search for the nearest vectors in your database. Finds conceptually similar content even when exact keywords don't match. Keyword search (BM25): Traditional text search. Finds exact keyword matches. Important for queries with specific terms, names, identifiers, or codes that semantic search may not capture. Hybrid search: Combine semantic and keyword search results. This is the recommended approach for production systems. Hybrid search improves recall by 20-30% over semantic search alone.

Implementation: Run both searches, normalize scores to the same range, and combine with a weighted average (typically 0.7 semantic + 0.3 keyword, but tune for your data).

Reranking: After initial retrieval, use a cross-encoder model to re-score results. Cross-encoders are more accurate than embedding similarity but too slow for initial retrieval on large corpora. Use them as a second stage on the top 20-50 results.

Reranking models: Cohere Rerank, bge-reranker, ColBERT.

Retrieval parameters:

Top-K: Number of chunks to retrieve. Start with 5-10. More isn't always better. Irrelevant chunks dilute the context and can confuse the generator.
Similarity threshold: Minimum similarity score to include a result. Prevents low-relevance chunks from entering the context.
Maximum context length: Total tokens sent to the generator. Balance between providing enough context and staying within the model's effective context window.

5. Generation

The LLM generates an answer based on the retrieved context and the user's query.

Prompt structure: ``


System: You are a helpful assistant. Answer questions based on the provided context. If the context doesn't contain enough information to answer the question, say so.
Context:
[Retrieved chunk 1]
[Retrieved chunk 2]
[Retrieved chunk 3]

User: [Query]``

Key prompt design decisions:

Always instruct the model to use only the provided context (reduces hallucination)
Tell the model what to do when context is insufficient (say "I don't know" rather than guess)
Request citations or source references if needed
Specify output format (paragraph, bullet points, structured data)

Model selection for generation:

GPT-4o: Highest quality, most expensive ($2.50/$10.00 per 1M input/output tokens)
Claude 3.5 Sonnet: Strong quality, good at following instructions ($3.00/$15.00)
GPT-4o-mini: Good quality at 10x lower cost ($0.15/$0.60)
Open-source 7B models: Lowest cost per query when self-hosted, adequate for many tasks

For most production RAG, GPT-4o-mini or a comparable model provides the best quality-per-dollar ratio.

Evaluation Framework

Without systematic evaluation, you're guessing at RAG quality. Build an evaluation pipeline before optimizing anything else.

Retrieval Evaluation

Retrieval precision: Of the chunks retrieved, how many are relevant? Measure by having domain experts label query-chunk pairs. Retrieval recall: Of all relevant chunks in the corpus, how many were retrieved? Requires a ground truth set of relevant documents for each query. Mean Reciprocal Rank (MRR): Where does the first relevant chunk appear in the ranked results? Higher is better.

Generation Evaluation

Faithfulness: Does the answer stick to information in the retrieved context, or does it hallucinate? The most critical metric for production RAG. Answer relevance: Does the answer address the question? A faithful answer that's off-topic is still useless. Completeness: Does the answer cover all aspects of the question using the available context?

Automated Evaluation Tools

Ragas: The most popular RAG evaluation framework. Provides automated metrics for faithfulness, relevance, and context precision. Good for continuous monitoring but should be supplemented with human evaluation. DeepEval: General LLM evaluation framework with RAG-specific metrics. Custom evaluation: Build your own evaluation set of 50-100 questions with ground truth answers. Run it after every change to the RAG pipeline. This is the most reliable approach but requires upfront investment.

Evaluation Cadence

After every architectural change (new embedding model, chunking strategy, retrieval method): Full evaluation run
Weekly: Automated metrics on a sample of production queries
Monthly: Human evaluation on 50-100 queries covering the full range of topics

Production Architecture

Basic RAG (MVP)

User query > Embed query > Vector search > Top-K chunks > LLM generation > Response

Components: One embedding model, one vector database, one LLM. This handles 80% of RAG use cases adequately.

Advanced RAG

User query > Query analysis (intent, entities) > Hybrid retrieval (semantic + keyword) > Reranking > Context compression > LLM generation with citation extraction > Response with sources

Additional components: Query classifier, BM25 index, reranker model, context compression module, citation extractor.

Agentic RAG

User query > Planning (decompose into sub-queries) > Multi-step retrieval (different sources per sub-query) > Synthesis (combine sub-answers) > Verification (check consistency) > Response

This pattern handles complex queries that require information from multiple sources. It's slower and more expensive but handles questions that basic RAG can't.

Common Tools and Stack Recommendations

Orchestration

LangChain: Most popular, extensive ecosystem, good for prototyping. Can be verbose for simple use cases.
LlamaIndex: Purpose-built for RAG. Cleaner abstractions for document processing and retrieval. Best choice for RAG-focused applications.
Custom code: For production systems where you need full control. Use libraries directly (OpenAI SDK, vector database client) without an orchestration layer.

Vector Databases

Under 1M documents: pgvector or Chroma
1M-10M documents: Qdrant or Weaviate
10M+ documents: Pinecone or Milvus
Need hybrid search: Weaviate or Qdrant

See our Vector Database Selection Guide for detailed comparison.

Document Processing

Unstructured: The most comprehensive document processing library. Handles PDFs, Word docs, HTML, images, and more.
LlamaParse: Document parsing service from the LlamaIndex team. Strong PDF parsing with table extraction.
PyMuPDF: Fast, lightweight PDF extraction. Good when you need speed over accuracy on clean PDFs.

Monitoring

LangSmith: Best tracing and debugging for LangChain-based systems. Shows every step of the RAG pipeline.
Phoenix (Arize): Open-source LLM observability. Good for tracking retrieval quality and generation metrics over time.
Custom logging: At minimum, log every query, the retrieved chunks, and the generated response. This enables debugging and evaluation.

The Five Most Common RAG Failures

1. Wrong Chunk Size

Chunks too large: retrieved context contains mostly irrelevant text, confusing the generator. Chunks too small: relevant information is split across chunks, and only one piece gets retrieved.

Fix: Experiment with 256, 512, and 1024 token chunks on your evaluation set. Measure retrieval precision and answer quality at each size. Most systems land between 256-512 tokens.

2. No Hybrid Search

Pure semantic search misses exact keyword matches. A query for "What is policy HR-2024-03?" won't find the document if semantic similarity between the query and the policy content is low. Adding BM25 keyword search catches these cases.

Fix: Implement hybrid search with a 0.7/0.3 semantic/keyword weight. Adjust weights based on your query distribution.

3. Poor Metadata Filtering

When your corpus spans multiple document types, time periods, or domains, retrieving across the entire corpus produces irrelevant results. A question about 2025 policies shouldn't return 2020 documents.

Fix: Attach metadata (date, document type, department, version) to every chunk. Use metadata filters in retrieval queries to narrow the search space.

4. No Retrieval Failure Handling

When the vector database returns no relevant chunks (low similarity scores), most RAG systems pass empty or irrelevant context to the generator, which then hallucinated an answer.

Fix: Set a similarity threshold. If no chunks exceed the threshold, return a clear "I don't have enough information to answer this question" response instead of generating from empty context.

5. No Evaluation Pipeline

Without measurement, you don't know if changes improve or degrade quality. Teams make changes based on intuition, break things they don't notice, and ship poor-quality systems.

Fix: Build an evaluation set of 50-100 queries with known good answers before you optimize anything. Run it after every change. Automate it.

Getting Started Checklist

Define your use case and success metrics (what questions will users ask? what quality is acceptable?)
Collect and process your document corpus
Choose chunking strategy (start with 512 tokens, 100 overlap)
Select embedding model and vector database (start simple: text-embedding-3-small + pgvector or Chroma)
Build basic retrieval pipeline (semantic search, top-5)
Build generation pipeline (GPT-4o-mini with context-grounded prompt)
Create evaluation set (50+ queries with expected answers)
Measure baseline performance
Iterate: add hybrid search, tune chunk size, add reranking, improve prompts
Deploy with monitoring (log every query, retrieval, and response)

The entire basic pipeline can be built in 2-3 days. Getting it to production quality typically takes 2-4 weeks of iteration guided by evaluation metrics.

Cost of a Production RAG System

Minimal RAG (small corpus, low traffic):

Vector database: $0-$65/month
Embeddings: $5-$50/month
LLM generation: $50-$500/month
Total: $55-$615/month

Standard RAG (medium corpus, moderate traffic):

Vector database: $65-$500/month
Embeddings: $50-$200/month
LLM generation: $500-$5,000/month
Reranker: $100-$500/month
Monitoring: $50-$200/month
Total: $765-$6,400/month

Enterprise RAG (large corpus, high traffic, multiple domains):

Vector database: $500-$5,000/month
Embeddings: $200-$2,000/month
LLM generation: $5,000-$50,000/month
Reranker: $500-$2,000/month
Monitoring: $200-$1,000/month
Infrastructure: $1,000-$10,000/month
Total: $7,400-$70,000/month

The biggest cost lever is the LLM generation step. Model routing (cheaper models for simple queries) and caching (skip generation for repeated queries) produce the largest cost reductions.

How AI Pulse data is built

Every number in this article comes from a continuously updated dataset of 3,824 weekly job postings across 42 roles and 14 industries. Salary figures are derived from postings that disclose compensation. AI penetration percentages reflect the share of postings in each function that explicitly require or prefer AI skills. Premium calculations compare median compensation for AI-skilled postings against same-function, same-seniority postings without AI requirements.

Sources & notes. AI Pulse weekly job posting index (n=3,824). Salary disclosure rate: 6.4%. Premium calculations require minimum n=20 postings per role-seniority cell. Updated weekly.

Last updated: 2026-04-03.

How this fits into the bigger career picture

Every article on AI Pulse connects back to the same dataset on AI adoption, salary premiums, and role trajectories. If you're early in your career thinking, the research index covers the full set of insights articles. If you're closer to a job move, the AI by role grid maps the adoption rate and salary premium for every function we track.

The pages that combine the data into a strategic read are the ai-for-* role hubs. Each one synthesizes the adoption story, salary thesis, displacement risk, and the strategic move for that function. If this article is about a specific role, browse the matching hub for the full picture: AI for engineering, marketing, sales, data and analytics, product management, and 19 more.

RAG Implementation Guide: Architecture and Tools

RAG Architecture Overview

1. Document Processing

2. Chunking

3. Embedding Generation

4. Retrieval

5. Generation

Evaluation Framework

Retrieval Evaluation

Generation Evaluation

Automated Evaluation Tools

Evaluation Cadence

Production Architecture

Basic RAG (MVP)

Advanced RAG

Agentic RAG

Common Tools and Stack Recommendations

Orchestration

Vector Databases

Document Processing

Monitoring

The Five Most Common RAG Failures

1. Wrong Chunk Size

2. No Hybrid Search

3. Poor Metadata Filtering

4. No Retrieval Failure Handling

5. No Evaluation Pipeline

Getting Started Checklist

Cost of a Production RAG System

How AI Pulse data is built

How this fits into the bigger career picture

Sources

Frequently Asked Questions

About the Author

Get Weekly AI Career Insights

Get AI Career Intel

RAG Implementation Guide: Architecture and Tools

RAG Architecture Overview

1. Document Processing

2. Chunking

3. Embedding Generation

4. Retrieval

5. Generation

Evaluation Framework

Retrieval Evaluation

Generation Evaluation

Automated Evaluation Tools

Evaluation Cadence

Production Architecture

Basic RAG (MVP)

Advanced RAG

Agentic RAG

Common Tools and Stack Recommendations

Orchestration

Vector Databases

Document Processing

Monitoring

The Five Most Common RAG Failures

1. Wrong Chunk Size

2. No Hybrid Search

3. Poor Metadata Filtering

4. No Retrieval Failure Handling

5. No Evaluation Pipeline

Getting Started Checklist

Cost of a Production RAG System

How AI Pulse data is built

How this fits into the bigger career picture

Sources

Frequently Asked Questions

Related Resources

About the Author

More Insights Like This

AI Agent Frameworks: CrewAI vs LangGraph vs AutoGen

Vector Database Selection Guide with Benchmarks

ML Portfolio Projects That Get You Hired in 2026

NLP Engineer vs LLM Engineer: Roles Compared

AI Portfolio Projects That Get You Hired

NLP Engineer Career Guide: Demand and Skills

Get Weekly AI Career Insights

Get AI Career Intel