Hallucinations are not a bug you can prompt your way out of. They are the predictable output of a probabilistic generator that has been handed insufficient or contradictory evidence. The fix is architectural — seven independent layers, each catching a different failure mode.
TL;DR — The 30-Second Answer
A production RAG system that targets sub-1% hallucination rate cannot rely on any single mechanism. The reliable pattern is defense in depth: query rewriting, hybrid retrieval, reranking, evidence sufficiency check, grounded generation, post-generation verification, and continuous evaluation. Each layer typically eliminates 40–70% of the residual errors from the previous one. Stack seven layers and you arrive at production-grade groundedness.
What Hallucinations Actually Look Like in Production
Not every wrong answer is a hallucination, and getting precise about the failure mode matters because each one needs a different defense.
| Failure mode | Example | Root cause |
|---|---|---|
| Intrinsic hallucination | Answer contradicts retrieved passage | Generation step ignoring context |
| Extrinsic hallucination | Answer adds facts not in any passage | Model extrapolating from training data |
| Citation fabrication | Invents a passage ID or URL that doesn't exist | Generation untethered from real retrieval set |
| Conflation | Merges two passages into one false claim | No clear attribution per claim |
| Overconfidence under no-evidence | Confident answer when retrieval returned nothing relevant | No refusal policy |
| Out-of-corpus inference | Answers a question the corpus cannot answer | No corpus-coverage check |
The defenses below map directly to these failure modes.
Layer 1 — Query Rewriting and Decomposition
The first hallucination defense happens before retrieval even starts. A vague query produces a noisy top-k, and a noisy top-k makes grounded generation harder.
pythonasync def rewrite_query(user_query: str, history: list[Message]) -> RewriteResult: """Resolve coreferences, expand acronyms, split multi-hop questions.""" prompt = build_rewrite_prompt(user_query, history) out = await rewriter_llm.complete(prompt, response_model=RewriteResult) # Multi-hop questions become multiple retrievals if out.is_multi_hop: out.subqueries = decompose(user_query) # Untargeted queries get filtered at this gate if out.intent == "small_talk" or out.intent == "unanswerable": out.skip_retrieval = True return out
Target metric: 90%+ of queries should rewrite to a single, unambiguous question or a clear set of sub-questions. Measure with a sampled human eval; a 200-row evaluation set is plenty.
Layer 2 — Hybrid Retrieval
Pure vector retrieval misses exact-match cases (SKUs, dates, named entities). Pure keyword retrieval misses paraphrase. Hybrid is the standard.
| Retrieval style | Recall@10 on FinanceQA | Latency overhead |
|---|---|---|
| Dense only (BGE-large) | 71% | Baseline |
| BM25 only | 64% | -10ms |
| Hybrid (RRF fusion) | 87% | +25ms |
| Hybrid + reranker | 94% | +120ms |
The hybrid combination is non-negotiable for production. The reranker is non-negotiable if your tail of long-tail queries matters (it always does).
Layer 3 — Reranking with Cross-Encoders
Bi-encoders (vector search) score documents independently. Cross-encoders score the (query, document) pair jointly and almost always rerank correctly when the bi-encoder misses.
Cost-effective rerankers in mid-2026:
- BGE-reranker-v2-m3 — open source, runs on a single L4, ~80ms for 50 candidates
- Cohere Rerank 3 — managed API, ~150ms, highest accuracy on heterogeneous corpora
- ColBERTv2 / PLAID — best latency/quality trade-off for long documents
The reranker also gives you a calibrated relevance score, which feeds directly into the next layer.
Layer 4 — Evidence Sufficiency Check (The Gate)
This is the most under-implemented and most impactful layer. Before generation, ask: is the retrieved evidence sufficient to answer the question?
pythondef evidence_sufficient(query: str, hits: list[Hit]) -> Decision: max_score = max(h.rerank_score for h in hits) top_3_avg = mean(h.rerank_score for h in hits[:3]) if max_score < 0.25: return Decision.REFUSE # corpus does not contain the answer if top_3_avg < 0.40: return Decision.CLARIFY # weak coverage, ask user to refine return Decision.PROCEED
A well-calibrated threshold here delivers two wins: it eliminates the "confident answer to an unanswerable question" failure mode, and it dramatically improves user trust because the assistant gracefully says "I don't have enough information in our docs to answer that" instead of confabulating.
Target metric: refusal precision over 95% — when the system refuses, it should be right to refuse more than 95% of the time.
Layer 5 — Grounded Generation with Citation Constraints
The generation prompt itself is a defense layer. Three properties of a hallucination-resistant prompt:
- Citation requirement per claim. Every factual sentence must end with a passage ID. The model is penalized for unsupported sentences.
- Hard refusal clause. "If the passages do not contain the answer, respond exactly with: I don't know based on the provided sources."
- Quote-or-paraphrase discipline. For numeric claims, the model must quote verbatim or include a unit.
textYou are answering using ONLY the passages between <passages> tags. RULES: 1. Every factual sentence must end with [P#] where # is the passage id. 2. If the passages do not contain the answer, respond exactly: "I don't know based on the provided sources." 3. For numeric claims, quote the number verbatim. 4. Never combine information from passages into novel claims not in any single passage. <passages> [P1] {passage_1} [P2] {passage_2} ... </passages> Question: {question}
This prompt, with the rerank-score gate from Layer 4, eliminates roughly half of remaining hallucinations on standard benchmarks.
Layer 6 — Post-Generation Verification
The model has produced an answer with citations. Now verify the citations actually support the claims. This is cheap because it runs only after generation and can be sampled.
pythonasync def verify_groundedness(answer: str, passages: list[Passage]) -> Verification: claims = await extract_atomic_claims(answer) results = [] for claim in claims: supported = await check_entailment( claim=claim, evidence=passages[claim.cited_passage_id], ) results.append((claim, supported)) score = sum(1 for _, s in results if s) / max(len(results), 1) return Verification(score=score, unsupported=[c for c, s in results if not s])
Tactics that work well at this layer:
- Atomic claim extraction. Split the answer into single-fact claims; verify each independently.
- NLI-based entailment. A small DeBERTa-style NLI model on each (claim, passage) pair is ~10x cheaper than another LLM call.
- Strict mode. If groundedness < 0.85, regenerate with a tighter prompt or escalate to refusal.
Target metric: groundedness score ≥ 0.95 across a rolling 7-day window.
Layer 7 — Continuous Evaluation on Production Traffic
The first six layers handle the request-time defense. The seventh closes the loop: continuously measure what's actually shipping.
| Signal | What it tells you | Frequency |
|---|---|---|
| LLM-as-judge groundedness on 3-5% sample | Live hallucination rate | Continuous |
| User thumbs-down feedback rate | Perceived correctness | Continuous |
| Corpus drift detection (new docs vs index) | When to re-embed | Daily |
| Retrieval recall on golden set | Whether retrieval still works | Weekly |
| Refusal rate by topic | Coverage gaps | Weekly |
A spike in refusal rate is usually the first sign of a coverage gap in the corpus. A spike in user thumbs-down without a corresponding refusal-rate move is usually a generation regression — often caused by a silent provider model update.
The Numbers, End to End
Production benchmarks across the four Synthara-built RAG systems we have permission to share aggregate stats from:
| Layer added | Hallucination rate | Latency impact |
|---|---|---|
| Vanilla RAG (top-5 dense, no reranker) | ~12% | Baseline |
| + Hybrid retrieval | ~8% | +25ms |
| + Reranker | ~5% | +120ms |
| + Sufficiency gate / refusal | ~3% | +5ms |
| + Grounded generation prompt | ~1.5% | 0ms |
| + Post-generation verification | ~0.6% | +200ms (async) |
| + Continuous eval feedback | ~0.3% | 0ms |
The compounding effect is the point. Each layer in isolation looks like a marginal improvement; together they take you from "tolerable demo" to "shippable in regulated industries."
Anti-Patterns We See Repeatedly
- "Just use a bigger model." Hallucinations on RAG tasks are largely insensitive to model size above 8B parameters. Architecture beats size here.
- "Lower the temperature." Helps a little; does not address out-of-corpus extrapolation, which is the dominant failure mode at temp=0.
- "Add 'do not hallucinate' to the prompt." Inert. The model has no introspective access to whether it is hallucinating.
- "Train your own model." Expensive, slow, and orthogonal to the problem. Fix retrieval and grounding first.
Frequently Asked Questions
What is a hallucination in a RAG system?
In a RAG context, a hallucination is any generated claim that is not supported by the retrieved evidence. This includes invented citations, fabricated numbers, and confident statements that contradict the source documents.
Why does adding RAG not eliminate hallucinations?
Retrieval reduces hallucinations but does not eliminate them. The model can still hallucinate when the retrieved passages are irrelevant, when it blends multiple passages incorrectly, when the question is unanswerable from the corpus, or when it tries to be helpful and extrapolates beyond evidence.
What is the most cost-effective hallucination defense?
An LLM-as-judge grounding check on a 3–5% sample of production traffic plus a hard refusal policy when retrieval confidence is below a threshold. This combination catches roughly 80% of hallucinations at under 5% of generation cost.
How do you measure hallucination rate?
Two metrics matter: attribution score — percentage of factual claims in the response that map to a specific retrieved passage; and groundedness — whether the response would still be true if the retrieved passages were the only source of truth. Both are measurable automatically with an LLM-as-judge pipeline.
Should the system ever refuse to answer?
Yes. Refusal is the single most-undervalued mechanism in production RAG. A confident "I don't know based on the provided sources" preserves trust; a confident wrong answer destroys it.
Key Takeaways
- Hallucinations are architectural, not prompt-engineering, problems.
- Defense in depth with seven independent layers drives production hallucination rates from ~12% to under 0.5%.
- Refusal is a feature — design for it explicitly.
- Continuous evaluation on production traffic is the only sustainable groundedness control.
- Bigger models do not fix hallucinations; better retrieval and grounding do.
What is a hallucination in a RAG system?
In a RAG context, a hallucination is any generated claim that is not supported by the retrieved evidence. This includes invented citations, fabricated numbers, and confident statements that contradict the source documents.
Why does adding RAG not eliminate hallucinations?
Retrieval reduces hallucinations but does not eliminate them. The model can still hallucinate when the retrieved passages are irrelevant, when it blends multiple passages incorrectly, when the question is unanswerable from the corpus, or when it tries to be helpful and extrapolates beyond evidence.
What is the most cost-effective hallucination defense?
An LLM-as-judge grounding check on a 3-5% sample of production traffic plus a hard refusal policy when retrieval confidence is below a threshold. This combination catches roughly 80% of hallucinations at under 5% of generation cost.
How do you measure hallucination rate?
Two metrics matter: (1) attribution score — percentage of factual claims in the response that map to a specific retrieved passage; (2) groundedness — whether the response would still be true if the retrieved passages were the only source of truth. Both are measurable automatically with an LLM-as-judge pipeline.
