Blog · 2026-05-17
Six grounding strategies that actually reduce LLM hallucination (and the trade-offs)
Prompt engineering buys 10-30%. Retrieval-augmented generation buys another 20-40%. Signed-claim verification closes the long tail. Six strategies, their measured impact, and when to combine.
You ship a chatbot. Users find a single hallucinated fact and screenshot it on Twitter. Your trust signal collapses. The features you spent six months building become irrelevant.
Frontier models in 2026 hallucinate ~1-5% on well-trodden questions, ~15-40% on long-tail technical queries. Reducing that rate isn't a single fix; it's a stack of mitigations layered on top of each other. Here are the six strategies that actually work, in order from cheapest to most-effective.
1. Temperature 0 + clear system prompt (10-15% reduction)
The cheapest win. Set temperature=0 for any task that involves factual recall (not creative writing). Add a system prompt that explicitly tells the model not to invent:
You are a precise assistant. If you don't know an answer, say "I don't know."
Never invent dates, parameter counts, paper authors, or citations.
If the user asks for a specific fact you're uncertain about, decline.Doesn't fix hallucinations — but reduces obvious ones where the model would otherwise confabulate fluently. Free. 5 minutes to ship.
2. Few-shot examples (15-25% reduction on extraction tasks)
For structured-output tasks (extract dates, names, prices), show the model 3-5 examples of correct extraction before asking it to do the real task. Few-shot prompting beats zero-shot for narrow factual extraction by 15-25% in our experience.
Best for: information extraction, classification, formatting. Not effective for: open-ended generation, citation, summary.
3. Retrieval-augmented generation (20-40% reduction)
The dominant strategy. Embed your knowledge base, retrieve top-K relevant chunks at query time, splice into the prompt:
SYSTEM: Answer using ONLY the context below. If the context doesn't
cover the question, say so. Always cite which context block you used.
CONTEXT:
[chunk 1]: Llama 3.1 was released on July 23, 2024. Three variants:
8B, 70B, 405B. Context window: 128k tokens.
[chunk 2]: ...
USER: When did Llama 3.1 come out?Catches roughly 60% of fabricated-source hallucinations. The remaining 30-40% gap is what we'll address next.
Frameworks: LangChain, LlamaIndex, Haystack. Vector DBs: FAISS, Pinecone, Weaviate, Qdrant, Chroma, pgvector.
4. Citation requirement + post-hoc check (closes ~50% of RAG's residual gap)
The classic RAG failure: retriever pulls the right document, model still emits wrong number on the page. The fix: force the model to emit inline citations + post-process to check they match.
SYSTEM: For every factual claim in your response, append [^N] where N
matches the context block. Citations must be verifiable against the
context. If you cannot cite a claim, mark it [^unverified].Post-process: scan the response for [^N] markers, verify each citation actually appears in the retrieved context. Strip or flag unverified claims before returning to user.
Costs: latency for the post-process pass + ~10% prompt overhead from citation instructions. Catches: most of the right-doc-wrong-number cases.
5. Signed-claim verification (catches the long tail)
Even with RAG + citations, two failure modes remain:
- Out-of-corpus assertions. The model claims something not in your retrieved context. RAG can't verify because the assertion has no source to check against.
- Fabricated citations. Model writes
[^1]but the [1] doesn't exist or doesn't support the claim.
The fix: query a separate verified-claim catalog post- generation. Extract atomic assertions from the response; look each up against a source-of-truth. We built SourceScore VERITAS for this — 206 hand-verified AI/ML claims with primary sources + HMAC signatures. Free tier, no signup. ~80ms per claim. Catches ~30% of RAG's residual hallucination gap.
For other verticals (non-AI/ML), Wikipedia + Wolfram Alpha + curated domain-specific knowledge bases play the same role. The pattern is what matters: a verification layer after generation that catches what RAG misses.
6. Constrained decoding (last resort for high-stakes outputs)
For outputs that must conform to a specific schema (JSON, BNF grammar), use a library like Instructor, Pydantic AI, Outlines, or vendor JSON-mode APIs. Constrained decoding guarantees the output fits the schema (zero schema-violation errors) but doesn't guarantee semantic correctness — a hallucinated date still type-checks.
Use constrained decoding for: schema enforcement, downstream system contracts. Combine with strategies 3-5 for factual correctness.
The honest stack
Production-grade grounding looks like:
- Temperature 0 + clear system prompt — free baseline
- Few-shot examples for narrow tasks — free
- RAG over your knowledge base — moderate cost
- Citation requirement + post-process check — low cost
- Signed-claim verification on residual claims — ~80ms per claim
- Constrained decoding for schema-required outputs — moderate cost
Combined, you can drive hallucination on AI/ML factual queries from ~30% (raw GPT-4o) down to ~3-5%. That's still not zero — but it's the difference between a chatbot users complain about and one they recommend.
What doesn't work
- "Just use a smarter model." Frontier models hallucinate less than older models, but not close to zero. Capability scaling has not killed hallucination.
- "Fine-tune the model on your factual corpus." Helps for in-corpus questions; the model still hallucinates out-of-corpus. Expensive + maintenance-heavy.
- "Just ask the model if it's sure." Self-reported confidence is poorly calibrated. The model says it's confident about hallucinated answers as often as correct ones.
Practical sequencing
If you're shipping a new AI feature today:
- Week 1: ship strategies 1 + 2 (temperature + system prompt + few-shot if applicable)
- Week 2-3: ship strategy 3 (RAG over your corpus)
- Week 4: ship strategy 4 (citation requirement + post-process)
- Week 5+: add strategy 5 (verification layer) for residual claims
- Whenever needed: strategy 6 (constrained decoding) for schema-bound outputs
Skipping straight to strategy 5 without 1-4 is a mistake — verification is most cost-effective on the long tail of claims, not the bulk. RAG + citations gets you most of the way; verification closes the last gap.