III.
Node kind ledger
Page 1 of 1TestSet
TestSet records
Browse all TestSet records in the current atlas snapshot.
Filters & facets1 active · 3 groups
homepageUrl
https://github.com/idavidrein/gpqa · 2https://github.com/evalplus/evalplus · 1https://github.com/openai/grade-school-math · 1https://rowanzellers.com/hellaswag/ · 1https://livecodebench.github.io/ · 1https://github.com/hendrycks/math · 1https://www.swebench.com/original.html · 1https://webarena.dev/static/paper.pdf · 1https://openreview.net/forum?id=roNSXZpUDN · 1https://arxiv.org/abs/2406.15877 · 1https://ds1000-code-gen.github.io/ · 1https://openai.com/index/mle-bench/ · 1
releasedAt
description
Canonical WebArena task artifact for autonomous web-agent evaluation.
· 1Canonical full-set artifact for BigCodeBench code-generation evaluation.
· 1Canonical DS-1000 artifact for data-science code-generation evaluation.
· 1Canonical HumanEval artifact for Python code-generation evaluation.
· 1Canonical MBPP artifact for basic Python program-synthesis evaluation.
· 1Canonical AgentBench artifact for broad LLM-as-agent evaluation.
· 1Canonical ToolBench evaluation artifact for API tool-use benchmarks.
· 1Canonical AndroidWorld artifact for autonomous Android UI-control
evaluation.
· 1Canonical RE-Bench artifact for frontier AI R&D agent evaluation.
· 1The December 2024 release of the SWE-bench Verified test set.
· 1
| id | displayName | cluster |
|---|---|---|
| test-set:math-test | MATH test split | benchmarks |