Tag
benchmark
7 verified claims carrying this tag. Each has 2+ primary sources and an HMAC-SHA256 signature.
MMLU benchmark introduced in paper: Measuring Massive Multitask Language Understanding (Hendrycks et al., 2020).
428d754e7c651be6 · 2 sources · 100% confidence
HumanEval benchmark introduced in paper: Evaluating Large Language Models Trained on Code (Chen et al., 2021).
71ec42731d2c9e0c · 2 sources · 100% confidence
SuperGLUE benchmark introduced in paper: SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems (Wang et al., 2019).
1a1e87145608c91a · 2 sources · 100% confidence
GLUE benchmark introduced in paper: GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding (Wang et al., 2018).
aa113b5e61d5c214 · 2 sources · 100% confidence
MTEB benchmark introduced in: Muennighoff et al. 2022 — Massive Text Embedding Benchmark.
cccd161dd058a31e · 2 sources · 100% confidence
ARC-AGI benchmark introduced in: Chollet 2019 — abstraction and reasoning corpus.
cc5df3c14d35fa49 · 2 sources · 100% confidence
SWE-bench introduced in: Jimenez et al. 2024 — software engineering benchmark from GitHub issues.
b16b5f5297e5f621 · 2 sources · 100% confidence