Tag
evaluation
8 verified claims carrying this tag. Each has 2+ primary sources and an HMAC-SHA256 signature.
MMLU benchmark introduced in paper: Measuring Massive Multitask Language Understanding (Hendrycks et al., 2020).
428d754e7c651be6 · 2 sources · 100% confidence
SuperGLUE benchmark introduced in paper: SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems (Wang et al., 2019).
1a1e87145608c91a · 2 sources · 100% confidence
GLUE benchmark introduced in paper: GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding (Wang et al., 2018).
aa113b5e61d5c214 · 2 sources · 100% confidence
Chatbot Arena introduced in: Zheng et al. 2023 — LMSYS open platform for evaluating LLMs by human preference.
789ddc9bc9c3d688 · 2 sources · 100% confidence
AlpacaEval introduced in: Li et al. 2023 — LLM-as-judge evaluation benchmark.
2f14f3078741c0ad · 2 sources · 100% confidence
LangSmith publicly released on: 2023-07-18 by LangChain — LLM observability + evaluation platform.
9ef37fbd1460c501 · 2 sources · 100% confidence
MTEB benchmark introduced in: Muennighoff et al. 2022 — Massive Text Embedding Benchmark.
cccd161dd058a31e · 2 sources · 100% confidence
SWE-bench introduced in: Jimenez et al. 2024 — software engineering benchmark from GitHub issues.
b16b5f5297e5f621 · 2 sources · 100% confidence