SourceScore

Topic hub · 10 claims

Evaluation, benchmarks, and the harness problem

The benchmarks that define "capable model" — and the methodology caveats that make cross-paper comparisons unreliable. Hand-verified primary sources for every benchmark cited in the literature.

Why benchmarks matter — and why they mislead

Benchmarks are how the field measures progress. MMLU, HumanEval, GLUE, SuperGLUE, Chatbot Arena — each tries to capture a different dimension of capability (knowledge breadth, code generation, language understanding, conversational quality). But the same benchmark name can produce different scores across different evaluation harnesses + prompt formats + decoding strategies, which is exactly why VERITAS does not ship performance-comparison claims (see /blog/why-no-performance-claims/).

The classics

GLUE (Wang et al. 2018) and SuperGLUE (Wang et al. 2019) were the first standardized natural-language-understanding benchmarks. ImageNet (Deng et al., CVPR 2009) preceded them in vision. BLEU (Papineni et al., ACL 2002) and ROUGE (Lin, ACL 2004) measured machine translation and summarization. These benchmarks shaped a decade of progress.

The LLM-era benchmarks

MMLU (Hendrycks et al. 2021) tests knowledge breadth across 57 subjects. HumanEval (Chen et al., OpenAI 2021) tests code generation. AlpacaEval (Tatsu Lab 2023) uses LLM-as-judge. Chatbot Arena (LMSYS 2023) uses pairwise human preferences. Each adds methodological subtlety: which split? which prompt? few-shot or zero-shot? chain-of-thought? The right reading is: track benchmarks as trend signals, not absolute rankings.

Defined terms (3)

Benchmark
A standardized dataset and evaluation protocol designed to measure a specific capability across multiple models.
Evaluation harness
Software that runs an LLM through a benchmark in a reproducible way. Different harnesses (LM Evaluation Harness, HELM, lm-eval) produce different scores for the same nominal benchmark.
LLM-as-judge
Evaluation approach where one LLM scores the outputs of another. Used by AlpacaEval and MT-Bench. Cheaper than human evaluation; biased toward judge-model preferences.

All claims in this topic (10)

Related

Framework integrations