III.
Node kind ledger
Page 1 of 1TestSet
TestSet records
Browse all TestSet records in the current atlas snapshot.
Filters & facets3 groups
homepageUrl
https://github.com/idavidrein/gpqa · 2https://github.com/evalplus/evalplus · 1https://github.com/openai/grade-school-math · 1https://rowanzellers.com/hellaswag/ · 1https://livecodebench.github.io/ · 1https://github.com/hendrycks/math · 1https://www.swebench.com/original.html · 1https://webarena.dev/static/paper.pdf · 1https://openreview.net/forum?id=roNSXZpUDN · 1https://arxiv.org/abs/2406.15877 · 1https://ds1000-code-gen.github.io/ · 1https://openai.com/index/mle-bench/ · 1
releasedAt
description
Canonical WebArena task artifact for autonomous web-agent evaluation.
· 1Canonical full-set artifact for BigCodeBench code-generation evaluation.
· 1Canonical DS-1000 artifact for data-science code-generation evaluation.
· 1Canonical HumanEval artifact for Python code-generation evaluation.
· 1Canonical MBPP artifact for basic Python program-synthesis evaluation.
· 1Canonical AgentBench artifact for broad LLM-as-agent evaluation.
· 1Canonical ToolBench evaluation artifact for API tool-use benchmarks.
· 1Canonical AndroidWorld artifact for autonomous Android UI-control
evaluation.
· 1Canonical RE-Bench artifact for frontier AI R&D agent evaluation.
· 1The December 2024 release of the SWE-bench Verified test set.
· 1
| id | displayName | cluster |
|---|---|---|
| test-set:agentbench-environments | AgentBench multi-environment suite | benchmarks |
| test-set:agentclinic-clinical-cases | AgentClinic clinical case suite | benchmarks |
| test-set:androidworld-programmatic-tasks | AndroidWorld programmatic task suite | benchmarks |
| test-set:appworld-750-cross-app-tasks | AppWorld 750 cross-app tasks | benchmarks |
| test-set:assistantbench-214-web-tasks | AssistantBench 214 web tasks | benchmarks |
| test-set:bfcl-v3 | Berkeley Function Calling Leaderboard v3 | benchmarks |
| test-set:bigcode-evalplus | BigCode EvalPlus | benchmarks |
| test-set:bigcodebench-full | BigCodeBench full set | benchmarks |
| test-set:ds1000-full | DS-1000 full set | benchmarks |
| test-set:flores-200-devtest | FLORES-200 devtest | benchmarks |
| test-set:gaia-validation | GAIA validation split | benchmarks |
| test-set:gpqa-diamond | GPQA Diamond | benchmarks |
| test-set:gpqa-diamond-2024 | GPQA Diamond — 2024 release | benchmarks |
| test-set:gsm8k-test | GSM8K test split | benchmarks |
| test-set:hellaswag-validation | HellaSwag validation | benchmarks |
| test-set:humaneval-original | HumanEval original problem set | benchmarks |
| test-set:livecodebench-2024-12 | LiveCodeBench 2024-12 cut | benchmarks |
| test-set:math-test | MATH test split | benchmarks |
| test-set:mbpp-full | MBPP full problem set | benchmarks |
| test-set:mle-bench-competitions | MLE-bench Kaggle competition set | benchmarks |
| test-set:osworld-369-computer-tasks | OSWorld 369 real-computer tasks | benchmarks |
| test-set:re-bench-ai-rd-tasks | RE-Bench AI R&D task suite | benchmarks |
| test-set:swe-bench-original | SWE-bench original test set | benchmarks |
| test-set:swe-bench-verified-2024-12 | SWE-bench Verified 2024-12 | benchmarks |
| test-set:tau-bench-airline-retail | tau-bench airline and retail domains | benchmarks |
| test-set:terminal-bench-v1 | Terminal-Bench v1 | benchmarks |
| test-set:the-agent-company-workplace-tasks | TheAgentCompany workplace task suite | benchmarks |
| test-set:toolbench-tooleval | ToolBench ToolEval suite | benchmarks |
| test-set:travelplanner-planning-intents | TravelPlanner planning intents | benchmarks |
| test-set:truthful-qa-mc | TruthfulQA — multiple-choice | benchmarks |
| test-set:webarena-v1 | WebArena v1 task suite | benchmarks |