The "RAG vs fine-tuning" debate has run on for three years. It rests on a category error: they answer different questions. RAG is about what the model knows. Fine-tuning is about how the model behaves. The teams that ship reliably treat them as complementary tools, not alternatives.
TL;DR — Pick by Question Type
| Need | Tool |
|---|---|
| New facts the model doesn't know | RAG |
| Stable, structured output format | Fine-tuning |
| Domain-specific style or tone | Fine-tuning |
| Frequently-updated knowledge | RAG |
| Reduce token cost of a long system prompt | Fine-tuning |
| Classification, routing, tagging | Fine-tuning (or a small purpose-built model) |
| Citation requirement | RAG |
What Each One Actually Does
RAG (Retrieval-Augmented Generation) keeps the model unchanged and supplies relevant facts at inference time through a vector store or hybrid search. Knowledge is external; the model is a reasoner over retrieved evidence.
Fine-tuning adjusts the model's weights (or adds an adapter) so that the model's default behavior shifts. The model learns patterns — formats, decision rules, vocabularies — that become part of "how it thinks."
Read it backwards: if your problem is "the model doesn't know X," RAG. If your problem is "the model doesn't act like Y," fine-tune.
The Workload Signatures
Recognising the workload pattern is most of the decision.
RAG-shaped workloads
- "The model needs to answer questions about our docs."
- "The model needs to cite where its answer came from."
- "The knowledge changes — every week, month, or quarter."
- "Different users see different facts (multi-tenant data)."
- "We need to audit which sources informed each answer."
Fine-tuning-shaped workloads
- "We need every response in this strict JSON schema."
- "The tone should sound like our brand voice."
- "Classify support tickets into our 47 internal categories."
- "Routing decisions where a tiny model is enough if it learns our patterns."
- "We're paying for 8,000 tokens of few-shot examples on every request and want to bake them in."
Hybrid workloads (most of them)
Real production systems usually need both:
- Customer support agent: RAG for product docs, fine-tune for tone and ticket-formatting.
- Legal review assistant: RAG for relevant case law and contract clauses, fine-tune for the firm's clause taxonomy.
- Internal search and Q&A: RAG for documents, fine-tune the small intent classifier that routes queries.
- Sales SDR: RAG for prospect company data, fine-tune for the email writing style.
The Trap of Fine-Tuning for Knowledge
The most-common mistake we see: fine-tuning a model on a knowledge base.
The pitch sounds plausible: "we have 10,000 internal documents; let's bake them into the model so it can answer questions about them without retrieval." It does not work.
Three reasons:
- Knowledge becomes opaque. A fine-tuned model doesn't cite its sources. You cannot audit which document drove which answer.
- Knowledge becomes stale immediately. Re-fine-tuning every time docs change is impractical and expensive.
- Knowledge becomes hallucinable. A fine-tuned-on-facts model still hallucinates — and now with more confidence and no retrievable source to ground it.
The correct shape: keep documents in retrieval, fine-tune only the behaviour (citation format, response structure, tone).
The Trap of RAG for Behavior
The mirror mistake: trying to enforce strict output format through prompt engineering and retrieval, when fine-tuning would solve it once.
A team paying for 6,000 tokens of system prompt and few-shot examples on every request to get strict JSON output is paying interest forever. A LoRA fine-tune on 500 examples costs $20-$200 once and removes that 6,000-token tax permanently.
Symptoms of the "RAG for behaviour" anti-pattern:
- System prompts longer than 4,000 tokens.
- Prompt files with twenty examples just to get the format right.
- Frequent format failures requiring retry logic.
- Output post-processing that wouldn't be needed if the model just did it correctly.
Fine-tune.
Numbers, Honestly
A small LoRA fine-tune on a 70B-parameter open-weight model in mid-2026:
| Resource | Typical |
|---|---|
| Training data | 200-5,000 examples |
| Training time on 8× H100 | 2-8 hours |
| Cost (managed providers) | $20-$2,000 |
| Adapter size | 50-500 MB |
| Inference overhead vs base | <2% |
A small open-weight model (3-8B) fine-tuned for a narrow task often beats a frontier model with extensive prompting on the same task — at 1/50th the inference cost. The economics push toward fine-tuning whenever the task is stable enough to justify it.
The Combined Pattern
The architecture we deploy when both apply:
User → Intent classifier (fine-tuned 3B model) → Route
│
┌─────────────────┼─────────────────┐
│ │ │
Simple-Q route RAG route Tool-use route
(fine-tuned 8B) (RAG + frontier) (frontier + tools)
│ │ │
│ RAG retrieves │
│ │ │
▼ ▼ ▼
Fine-tuned format adapter
(citation style, JSON schema)
│
▼
Response
The intent classifier and the format adapter are fine-tuned. The knowledge source is RAG. The reasoning is supplied by a frontier model that does not need to be fine-tuned at all.
How to Decide, Mechanically
A two-question funnel:
Q1: Does the answer to this kind of query depend on facts that change over time, vary by user, or need to be cited?
- Yes → RAG is mandatory.
Q2: Does the response need to follow a strict format, tone, or decision rule that prompting alone fails to enforce reliably?
- Yes → Fine-tune on top of the RAG output.
- No → RAG with prompt engineering is enough.
If you answered "no" to Q1 and "yes" to Q2, you have a pure fine-tuning workload — rare in business contexts, common in classification and tagging.
Common Misconceptions
- "Fine-tuning is expensive." Not in 2026. LoRA adapters cost tens to hundreds of dollars to train. The expensive part is curating the dataset.
- "RAG is slow." Not when implemented properly. A well-built RAG pipeline runs end-to-end in under 1.5 seconds (see the edge inference post).
- "You can fine-tune your way to factual accuracy." No. The model learns patterns, not facts. Use retrieval for facts.
- "You only need RAG, never fine-tuning." Wrong in any product where output format, tone, or routing accuracy materially affect UX or cost.
Frequently Asked Questions
Should I use RAG or fine-tune a model?
Most production needs are RAG. Fine-tuning is correct when you need to change the model's style, format, or behavior — not when you need to give it new facts. The two are complementary, not alternatives.
What does fine-tuning actually change?
Fine-tuning shifts the model's behavior toward patterns in the training data. It is excellent for teaching format, style, tone, and decision rules. It is poor for teaching facts.
Is LoRA fine-tuning production-ready in 2026?
Yes. LoRA and QLoRA are standard for adapting open-weight models. Managed providers offer comparable adapter training services.
How much data do I need to fine-tune?
For style/format/tone: 200-2,000 examples is usually enough. For decision rules and new task formats: 1,000-10,000. For new domain knowledge: fine-tuning is the wrong tool — use RAG instead.
Can I fine-tune a model on top of RAG?
Yes, and this is the most common production pattern. The model learns how to use retrieved evidence; the retrieval supplies what to use.
Key Takeaways
- RAG and fine-tuning solve different problems — facts versus style — and combine well.
- Fine-tune for behavior; retrieve for knowledge.
- Frequently-updated domains belong in RAG; stable patterns belong in fine-tuning.
- Most production teams over-fine-tune and under-RAG. The reverse is usually correct.
- The two-question funnel — does it depend on changing facts? does it need strict behavior? — answers the choice for most workloads.
Should I use RAG or fine-tune a model?
Most production needs are RAG. Fine-tuning is correct when you need to change the model's style, format, or behavior — not when you need to give it new facts. The two are complementary, not alternatives.
What does fine-tuning actually change?
Fine-tuning shifts the model's behavior toward patterns in the training data. It is excellent for teaching format, style, tone, and decision rules. It is poor for teaching facts, because facts retrieved at inference time are more reliable than facts baked into weights.
Is LoRA fine-tuning production-ready in 2026?
Yes. LoRA and QLoRA are standard for adapting open-weight models. A LoRA adapter on Llama 3.3 70B trained on 500-5,000 high-quality examples typically reaches its quality ceiling, and managed providers (Mistral, OpenAI, Bedrock) offer comparable adapter training services.
How much data do I need to fine-tune?
For style/format/tone: 200-2,000 examples is usually enough. For decision rules and new task formats: 1,000-10,000. For new domain knowledge: fine-tuning is the wrong tool — use RAG instead.
