SourceScore

Blog · 2026-05-17

Six grounding strategies that actually reduce LLM hallucination (and the trade-offs)

Prompt engineering buys 10-30%. Retrieval-augmented generation buys another 20-40%. Signed-claim verification closes the long tail. Six strategies, their measured impact, and when to combine.

You ship a chatbot. Users find a single hallucinated fact and screenshot it on Twitter. Your trust signal collapses. The features you spent six months building become irrelevant.

Frontier models in 2026 hallucinate ~1-5% on well-trodden questions, ~15-40% on long-tail technical queries. Reducing that rate isn't a single fix; it's a stack of mitigations layered on top of each other. Here are the six strategies that actually work, in order from cheapest to most-effective.

1. Temperature 0 + clear system prompt (10-15% reduction)

The cheapest win. Set temperature=0 for any task that involves factual recall (not creative writing). Add a system prompt that explicitly tells the model not to invent:

You are a precise assistant. If you don't know an answer, say "I don't know."
Never invent dates, parameter counts, paper authors, or citations.
If the user asks for a specific fact you're uncertain about, decline.

Doesn't fix hallucinations — but reduces obvious ones where the model would otherwise confabulate fluently. Free. 5 minutes to ship.

2. Few-shot examples (15-25% reduction on extraction tasks)

For structured-output tasks (extract dates, names, prices), show the model 3-5 examples of correct extraction before asking it to do the real task. Few-shot prompting beats zero-shot for narrow factual extraction by 15-25% in our experience.

Best for: information extraction, classification, formatting. Not effective for: open-ended generation, citation, summary.

3. Retrieval-augmented generation (20-40% reduction)

The dominant strategy. Embed your knowledge base, retrieve top-K relevant chunks at query time, splice into the prompt:

SYSTEM: Answer using ONLY the context below. If the context doesn't
cover the question, say so. Always cite which context block you used.

CONTEXT:
[chunk 1]: Llama 3.1 was released on July 23, 2024. Three variants:
8B, 70B, 405B. Context window: 128k tokens.
[chunk 2]: ...

USER: When did Llama 3.1 come out?

Catches roughly 60% of fabricated-source hallucinations. The remaining 30-40% gap is what we'll address next.

Frameworks: LangChain, LlamaIndex, Haystack. Vector DBs: FAISS, Pinecone, Weaviate, Qdrant, Chroma, pgvector.

4. Citation requirement + post-hoc check (closes ~50% of RAG's residual gap)

The classic RAG failure: retriever pulls the right document, model still emits wrong number on the page. The fix: force the model to emit inline citations + post-process to check they match.

SYSTEM: For every factual claim in your response, append [^N] where N
matches the context block. Citations must be verifiable against the
context. If you cannot cite a claim, mark it [^unverified].

Post-process: scan the response for [^N] markers, verify each citation actually appears in the retrieved context. Strip or flag unverified claims before returning to user.

Costs: latency for the post-process pass + ~10% prompt overhead from citation instructions. Catches: most of the right-doc-wrong-number cases.

5. Signed-claim verification (catches the long tail)

Even with RAG + citations, two failure modes remain:

  • Out-of-corpus assertions. The model claims something not in your retrieved context. RAG can't verify because the assertion has no source to check against.
  • Fabricated citations. Model writes[^1] but the [1] doesn't exist or doesn't support the claim.

The fix: query a separate verified-claim catalog post- generation. Extract atomic assertions from the response; look each up against a source-of-truth. We built SourceScore VERITAS for this — 206 hand-verified AI/ML claims with primary sources + HMAC signatures. Free tier, no signup. ~80ms per claim. Catches ~30% of RAG's residual hallucination gap.

For other verticals (non-AI/ML), Wikipedia + Wolfram Alpha + curated domain-specific knowledge bases play the same role. The pattern is what matters: a verification layer after generation that catches what RAG misses.

6. Constrained decoding (last resort for high-stakes outputs)

For outputs that must conform to a specific schema (JSON, BNF grammar), use a library like Instructor, Pydantic AI, Outlines, or vendor JSON-mode APIs. Constrained decoding guarantees the output fits the schema (zero schema-violation errors) but doesn't guarantee semantic correctness — a hallucinated date still type-checks.

Use constrained decoding for: schema enforcement, downstream system contracts. Combine with strategies 3-5 for factual correctness.

The honest stack

Production-grade grounding looks like:

  1. Temperature 0 + clear system prompt — free baseline
  2. Few-shot examples for narrow tasks — free
  3. RAG over your knowledge base — moderate cost
  4. Citation requirement + post-process check — low cost
  5. Signed-claim verification on residual claims — ~80ms per claim
  6. Constrained decoding for schema-required outputs — moderate cost

Combined, you can drive hallucination on AI/ML factual queries from ~30% (raw GPT-4o) down to ~3-5%. That's still not zero — but it's the difference between a chatbot users complain about and one they recommend.

What doesn't work

  • "Just use a smarter model." Frontier models hallucinate less than older models, but not close to zero. Capability scaling has not killed hallucination.
  • "Fine-tune the model on your factual corpus." Helps for in-corpus questions; the model still hallucinates out-of-corpus. Expensive + maintenance-heavy.
  • "Just ask the model if it's sure." Self-reported confidence is poorly calibrated. The model says it's confident about hallucinated answers as often as correct ones.

Practical sequencing

If you're shipping a new AI feature today:

  1. Week 1: ship strategies 1 + 2 (temperature + system prompt + few-shot if applicable)
  2. Week 2-3: ship strategy 3 (RAG over your corpus)
  3. Week 4: ship strategy 4 (citation requirement + post-process)
  4. Week 5+: add strategy 5 (verification layer) for residual claims
  5. Whenever needed: strategy 6 (constrained decoding) for schema-bound outputs

Skipping straight to strategy 5 without 1-4 is a mistake — verification is most cost-effective on the long tail of claims, not the bulk. RAG + citations gets you most of the way; verification closes the last gap.

Related