Fine-tuning an LLM costs between $50 and $50,000 per run depending on model size and method. RAG costs $0.01-$0.10 per query with no upfront training investment. The decision between them isn't just about cost. It's about what kind of problem you're solving, how your data changes over time, and what quality bar you need to hit.
Most teams default to RAG because it's faster to implement. That's often the right call. But there are specific scenarios where fine-tuning produces meaningfully better results. Here's how to decide, and how to execute either approach well.
When to Fine-Tune
Fine-tuning is the right choice in a narrow set of high-value situations.
The Model Needs to Learn a Style or Format
When you need the model to consistently produce outputs in a specific format, tone, or structure that prompting alone can't achieve reliably. Examples:
- Medical documentation that follows specific clinical terminology conventions
- Legal briefs formatted according to court-specific requirements
- Code generation in a proprietary language or framework
- Customer communication that matches a brand's exact voice and terminology
You Have High-Quality, Consistent Training Data
Fine-tuning works best when you have 500+ carefully curated examples that demonstrate exactly what you want the model to do. The data should be:
- Consistent in quality (every example is correct and complete)
- Representative of the full range of inputs you'll see in production
- Free of contradictions (don't train the model to produce conflicting outputs)
- Vetted by domain experts (not auto-generated or scraped without review)
Latency Matters
RAG adds retrieval time to every query. A vector database lookup adds 10-50ms. Reranking adds another 20-100ms. For applications where total latency must be under 200ms, eliminating the retrieval step by fine-tuning knowledge directly into the model can be the right tradeoff.
This applies most to: real-time chat interfaces, voice assistants, inline code completion, and any application where users perceive delays.
You Need Domain Internalization
When the model needs to "think" in a domain's language rather than look up facts. A doctor doesn't consult a reference for every medical term. Similarly, a fine-tuned model can internalize domain vocabulary, relationships, and reasoning patterns in ways that RAG context injection can't replicate.
This works for: domain-specific reasoning, technical terminology usage, and tasks where the model needs to draw on specialized knowledge during multi-step reasoning.
When to Use RAG
RAG is the better choice for most production LLM applications.
Information Changes Frequently
If your knowledge base updates daily, weekly, or monthly, RAG wins. You update the document corpus without retraining. Fine-tuning requires a new training run every time the underlying information changes, which costs time and money.
Use RAG for: product catalogs, documentation that gets updated, news and current events, any source of truth that evolves.
You Need Citations
RAG naturally supports citations because the model generates answers based on specific retrieved documents. You can show users exactly which sources informed the answer. Fine-tuning bakes information into model weights, making it impossible to trace which training example influenced a specific output.
Use RAG for: any application where users need to verify answers, legal and medical applications, research tools, customer support where agents need to reference specific policies.
You Want to Avoid Hallucination on Factual Queries
RAG with proper retrieval reduces hallucination because the model generates answers grounded in retrieved documents rather than relying on memorized knowledge. Fine-tuning can reduce hallucination in the domain it was trained on but doesn't have the same grounding mechanism.
Your Budget Is Limited
A basic RAG system costs $100-$500 to set up (embedding generation, vector database, orchestration) and $0.01-$0.10 per query for ongoing costs. Fine-tuning a 7B model costs $50-$500 per run, and you'll run it multiple times as you iterate. For a 70B model, costs start at $5,000 per run.
If you're not sure whether fine-tuning will solve your problem, start with RAG. You can always fine-tune later if RAG doesn't meet your quality bar.
How to Fine-Tune Effectively
Step 1: Prepare Your Data
Data preparation takes 60-70% of the total fine-tuning effort. Don't rush this.
Format: JSON Lines (JSONL) with instruction/input/output triplets or conversation format depending on the model. Follow the base model's training format exactly. Quality control:- Review every example manually if you have fewer than 1,000
- Sample-review 10-20% if you have more
- Check for consistency: similar inputs should produce similar outputs
- Remove or fix contradictory examples
- Ensure coverage: your training data should represent the full distribution of production inputs
- Style/format adaptation: 100-500 examples
- Domain specialization: 1,000-10,000 examples
- Complex task learning: 10,000-100,000 examples
Step 2: Choose Your Method
LoRA (Low-Rank Adaptation)The default choice for most fine-tuning in 2026. LoRA trains small adapter layers instead of modifying all model weights.
Benefits:
- 60-80% reduction in GPU memory requirements
- 40-60% faster training time
- Can maintain multiple LoRA adapters and swap between them
- Minimal quality loss for most tasks
Combines LoRA with 4-bit quantization of the base model. Enables fine-tuning of 70B parameter models on a single 48GB GPU.
Benefits:
- Can fine-tune much larger models on consumer hardware
- Additional cost savings over standard LoRA
- Quality is within 1-3% of full LoRA for most tasks
Modifies all model weights. Rarely necessary and significantly more expensive.
Use full fine-tuning when: LoRA/QLoRA can't achieve the quality you need (rare), you're training on a very large dataset (100K+ examples), or you're fine-tuning a small model (under 3B parameters) where LoRA overhead isn't justified.
API-Based Fine-TuningOpenAI, Google, and other providers offer fine-tuning through their APIs. You upload training data, they handle the training.
Benefits:
- No GPU infrastructure to manage
- Simple API interface
- Pay per token of training data
- Limited control over training parameters
- Model weights aren't accessible (vendor lock-in)
- More expensive per run than self-hosted for large datasets
Step 3: Configure Training
Key hyperparameters:
Learning rate: Start at 2e-5 for LoRA, 1e-5 for full fine-tuning. Too high causes forgetting; too low wastes compute. Epochs: 1-3 for LoRA, 1-2 for full fine-tuning. More epochs on small datasets leads to overfitting. LoRA rank (r): 8-32 for most tasks. Higher rank captures more complex patterns but uses more memory. Start at 16. LoRA alpha: Typically 2x the rank (r=16, alpha=32). Controls the scaling of LoRA updates. Batch size: As large as your GPU memory allows. Gradient accumulation can simulate larger batch sizes.Step 4: Evaluate Rigorously
Task-specific metrics: Accuracy, F1, BLEU, ROUGE, or custom metrics depending on your task. Compare against the base model and against RAG on the same evaluation set. General capability testing: Fine-tuning can cause catastrophic forgetting, where the model gets better at your task but worse at general tasks. Test on a general benchmark (MMLU, HellaSwag) before and after fine-tuning. Human evaluation: Automated metrics don't capture everything. Have domain experts review 50-100 outputs from the fine-tuned model, scoring for accuracy, format compliance, and overall quality. A/B testing in production: The final evaluation is real-user behavior. Deploy the fine-tuned model alongside the baseline and compare user satisfaction, task completion rates, and error rates.Hybrid Architectures: Fine-Tuning + RAG
The best production systems often combine both approaches. Fine-tune for style, format, and base domain knowledge. Use RAG for specific facts, recent information, and citeable answers.
Pattern 1: Fine-Tuned Model + RAG Retrieval
Fine-tune a model to follow your output format and tone. At inference time, retrieve relevant context from a knowledge base and include it in the prompt. The model generates answers in the right style while grounding responses in retrieved documents.
This works well for: customer support systems (consistent tone + accurate product information), medical Q&A (clinical language + current treatment protocols), and legal research (proper citation format + relevant case law).
Pattern 2: Fine-Tuned Retrieval + Base Model
Fine-tune a smaller model specifically for retrieval: given a query, identify the most relevant documents. Use a general-purpose LLM for generation with the retrieved context.
This works well when: your retrieval needs are domain-specific but your generation needs are general-purpose.
Pattern 3: Fine-Tuned Router + Specialized Models
Train a small model to route queries to the appropriate handler: RAG for factual questions, fine-tuned model for style-specific generation, base model for general queries, and deterministic code for structured tasks.
This is the most sophisticated pattern but also the most effective for complex production systems with diverse query types.
Cost Comparison
Fine-Tuning Costs (One-Time per Training Run)
- LoRA on 7B model (cloud GPU): $50-$500
- QLoRA on 70B model (single GPU): $500-$5,000
- Full fine-tuning on 7B model: $500-$5,000
- Full fine-tuning on 70B model: $5,000-$50,000+
- API fine-tuning (OpenAI, moderate dataset): $100-$2,000
RAG Costs (Ongoing per Query)
- Embedding generation: $0.0001-$0.001 per query
- Vector database query: $0.0001-$0.001 per query
- LLM generation with context: $0.01-$0.10 per query
- Total per query: $0.01-$0.10
Break-Even Analysis
A LoRA fine-tuning run costs ~$200. If fine-tuning eliminates the need for RAG retrieval and reduces prompt length by 1,000 tokens per query ($0.01-$0.03 savings per query at typical API rates), the break-even is 7,000-20,000 queries. For a system handling 10,000 queries per day, fine-tuning pays for itself in 1-2 days.
For systems with low query volume (under 1,000 queries/day), RAG is almost always more cost-effective. For high-volume systems, the math favors fine-tuning, especially when combined with self-hosted inference.
Common Mistakes
Fine-Tuning on Bad Data
The most common mistake. Garbage in, garbage out. If your training examples contain errors, inconsistencies, or don't represent your production distribution, the fine-tuned model will reflect those problems. Spend twice as long on data preparation as you think you need.
Over-Fitting on Small Datasets
Training too many epochs on a small dataset causes the model to memorize training examples rather than learning patterns. Symptoms: perfect performance on training data, poor performance on held-out examples, and repetitive outputs. Solution: fewer epochs, more data, or LoRA with lower rank.
Ignoring Catastrophic Forgetting
Fine-tuning can degrade general model capabilities. If your fine-tuned model starts producing worse results on tasks outside your training domain, you've overtrained. Monitor general benchmarks alongside task-specific metrics.
Not Comparing Against RAG First
Always benchmark RAG performance before investing in fine-tuning. In many cases, a well-built RAG system with good retrieval achieves 90-95% of fine-tuning quality at a fraction of the cost and complexity. Only fine-tune if RAG demonstrably can't meet your quality requirements.
Fine-Tuning When Prompting Would Work
Sometimes the problem isn't that the model lacks knowledge. It's that the prompt isn't structured well. Before fine-tuning, try systematic prompt optimization: few-shot examples, chain-of-thought reasoning, structured output formatting. If prompting alone closes the quality gap, you've saved yourself significant time and money.
Tools and Frameworks
For Self-Hosted Fine-Tuning
- Hugging Face TRL: Most popular library for LoRA/QLoRA fine-tuning
- Axolotl: Higher-level wrapper around TRL with configuration-driven training
- LLaMA-Factory: Focused on LLaMA family models with a web UI
- Unsloth: Optimized for fast LoRA training (2x speedup claims)
For API Fine-Tuning
- OpenAI Fine-Tuning API: Supports GPT-4o-mini and GPT-4o
- Google Vertex AI: Supports Gemini model fine-tuning
- Together AI: Supports various open-source model fine-tuning
For Evaluation
- Ragas: RAG-specific evaluation framework
- DeepEval: General LLM evaluation framework
- Weights & Biases: Experiment tracking and comparison
- Promptfoo: Prompt and model comparison testing