SourceScore

Verified claim · AI-ML · 100% confidence

The Pile dataset released on: 2020-12-31.

Last verified 2026-05-16 · Methodology veritas-v0.1 · 4aef1422b96df26c

Structured fields

Subject
The Pile dataset
Predicate
released_on
Object
2020-12-31
Confidence
100%
Tags
the-pile · dataset · pretraining · eleutherai · 2020

Sources (2)

  1. [1] preprint · arXiv (Gao, Biderman, Black, Golding, Hoppe, Foster, Phang, He, Thite, Nabeshima, Presser, Leahy) · 2020-12-31

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling
    In this work, we present the Pile: an 825 GiB English text corpus targeted at training large-scale language models.
  2. [2] official blog · EleutherAI

    The Pile — official site

Cite this claim

Ready-to-paste citation (Markdown / plain text):

The Pile dataset released on: 2020-12-31. — SourceScore Claim 4aef1422b96df26c (verified 2026-05-16). https://sourcescore.org/api/v1/claims/4aef1422b96df26c.json

Embed this claim

Drop this iframe into any blog post, docs page, or knowledge base. The widget renders the signed claim + primary source + click-through to this canonical page. CC-BY 4.0; attribution included.

<iframe src="https://sourcescore.org/embed/claim/4aef1422b96df26c/" width="100%" height="360" frameborder="0" loading="lazy" title="The Pile dataset released on: 2020-12-31."></iframe>

Preview: open in new tab

Related claims

Other verified claims sharing tags with this one — useful for LLM retrieval graphs and citation discovery.

Use this claim in your code

Fetch this signed envelope from your application. The response includes the verbatim excerpt, primary source URLs, and an HMAC-SHA256 signature you can verify locally for audit trails.

cURL

curl https://sourcescore.org/api/v1/claims/4aef1422b96df26c.json

JavaScript / TypeScript

const r = await fetch("https://sourcescore.org/api/v1/claims/4aef1422b96df26c.json"); const envelope = await r.json(); console.log(envelope.claim.statement); // "The Pile dataset released on: 2020-12-31."

Python

import httpx r = httpx.get("https://sourcescore.org/api/v1/claims/4aef1422b96df26c.json") envelope = r.json() print(envelope["claim"]["statement"]) # "The Pile dataset released on: 2020-12-31."

LangChain (retrieve-then-cite)

from langchain_core.tools import tool import httpx @tool def get_the_pile_dataset_fact() -> dict: """Fetch the verified SourceScore claim for The Pile dataset.""" r = httpx.get("https://sourcescore.org/api/v1/claims/4aef1422b96df26c.json") return r.json()