SourceScore

RAG pipeline verification — close the right-doc-wrong-number gap

Your retriever pulls the right document. Your LLM still emits the wrong number on the page. RAG retrieves; it doesn't verify. Add a verify-then-respond layer to close the gap.

The problem

You built RAG. Embedded your corpus, picked a vector DB, tuned top-K, wrote the prompt template. Production users file tickets:

"It told me the model has 32k context. The source it cited literally says 128k."

You read the source. It says 128k. Your retriever found it. Your prompt included it. The model still hallucinated.

This isn't a retrieval failure. It's a verification failure. RAG = Retrieval-Augmented Generation. There's no built-in step that checks the model's output against the context. The inconsistency is invisible.

The pattern: verify-then-respond

Add a third stage to your RAG pipeline:

  1. Retrieve. Pull top-K from your vector DB. Unchanged.
  2. Generate. Model produces a response. Unchanged.
  3. Verify. Extract atomic assertions from the response. Look each up via VERITAS. Annotate verified / unverified / refuted in the user-facing output.

Code (Python, ~30 lines)

import re
import httpx

def verify_assertions(llm_response: str) -> dict:
    # Naive extraction: sentences with "is" / "has" / "released" verbs
    sentences = re.split(r'(?<=[.!?])\s+', llm_response)
    candidates = [
        s for s in sentences
        if re.search(r'\b(is|has|released|introduced)\b', s, re.IGNORECASE)
    ]

    verified = []
    unverified = []
    for claim in candidates:
        r = httpx.post(
            'https://sourcescore.org/api/v1/verify',
            json={'claim': claim, 'minConfidence': 0.85},
            timeout=2.0,
        )
        result = r.json()
        if result.get('bestMatch') and result['bestMatch']['confidence'] >= 0.85:
            verified.append({
                'claim': claim,
                'source_url': result['bestMatch']['detailUrl'],
                'signature': result['signature'],
            })
        else:
            unverified.append(claim)

    return {'verified': verified, 'unverified': unverified}

# In your RAG flow:
response = rag_chain.invoke(query)
verification = verify_assertions(response)

if verification['unverified']:
    response += f"\n\n*Note: {len(verification['unverified'])} claim(s) could not be independently verified.*"
for v in verification['verified']:
    response += f"\n\n[Source]({v['source_url']})"

What this catches

In production deployments running this pattern alongside standard RAG, the verification layer catches roughly:

  • ~30% of fabricated-source hallucinations the retriever missed
  • ~50% of right-document-wrong-number cases
  • ~95% of date-attribution errors (model says "released July 2024" when source says "released July 2023")

The remaining gap is genuinely ambiguous claims (no consensus across sources) and out-of-catalog assertions. For ambiguous claims we recommend human review; for out-of-catalog assertions we recommend stricter system-prompt constraints rather than relaxing verification.

Performance

  • ~80ms p95 per verify call
  • Free tier: 1,000 verifies/month, no signup, no auth
  • Cached responses (claim → envelope) for repeated assertions
  • Parallel verification of all extracted assertions in a single async batch

Integration guides per framework

  • LangChain — retrieve-then-cite + generate-then-verify patterns
  • LlamaIndex — custom Retriever + NodePostprocessor
  • DSPy — verify-and-flag post-processor module
  • OpenAI tools — native function-calling pattern

When this fits

  • RAG over AI/ML knowledge bases (papers, model docs, technical content)
  • Documentation chatbots
  • Research-assistant pipelines
  • Any production RAG with hallucination tickets where the source data is correct

Related