Agentic AI Atlasby a5c.ai
OverviewWikiGraphFor AgentsEdgesSearchWorkspace
/
GitHubDocsDiscord
iiiNode kind
Agentic AI Atlas · TestSet
31 recordsa5c.ai
Search kind facets/
Atlas · node kind

Current kind and facets

III.TestSetpp. 1 - 1
homepageUrl: https://github.com/idavidrein/gpqahomepageUrl: https://github.com/evalplus/evalplusreleasedAt: 2023-11-29releasedAt: 2024-12-01description: Canonical WebArena task artifact for autonomous web-agent evaluation. description: Canonical full-set artifact for BigCodeBench code-generation evaluation.
III.
Node kind ledger

TestSet

Page 1 of 1

TestSet records

Browse all TestSet records in the current atlas snapshot.

Cluster · benchmarksTotal · 31Visible · 31
Filters & facets3 groups

homepageUrl

https://github.com/idavidrein/gpqa · 2https://github.com/evalplus/evalplus · 1https://github.com/openai/grade-school-math · 1https://rowanzellers.com/hellaswag/ · 1https://livecodebench.github.io/ · 1https://github.com/hendrycks/math · 1https://www.swebench.com/original.html · 1https://webarena.dev/static/paper.pdf · 1https://openreview.net/forum?id=roNSXZpUDN · 1https://arxiv.org/abs/2406.15877 · 1https://ds1000-code-gen.github.io/ · 1https://openai.com/index/mle-bench/ · 1

releasedAt

2023-11-29 · 22024-12-01 · 22024-06-17 · 22023-05-08 · 12021-10-27 · 12019-05-19 · 12021-03-05 · 12023-10-10 · 12023-07-26 · 12022-11-21 · 12024-10-09 · 12021-07-07 · 1

description

Canonical WebArena task artifact for autonomous web-agent evaluation. · 1Canonical full-set artifact for BigCodeBench code-generation evaluation. · 1Canonical DS-1000 artifact for data-science code-generation evaluation. · 1Canonical HumanEval artifact for Python code-generation evaluation. · 1Canonical MBPP artifact for basic Python program-synthesis evaluation. · 1Canonical AgentBench artifact for broad LLM-as-agent evaluation. · 1Canonical ToolBench evaluation artifact for API tool-use benchmarks. · 1Canonical AndroidWorld artifact for autonomous Android UI-control evaluation. · 1Canonical RE-Bench artifact for frontier AI R&D agent evaluation. · 1The December 2024 release of the SWE-bench Verified test set. · 1
id-ascid-descname-ascname-desc
iddisplayNamecluster
test-set:agentbench-environmentsAgentBench multi-environment suitebenchmarks
test-set:agentclinic-clinical-casesAgentClinic clinical case suitebenchmarks
test-set:androidworld-programmatic-tasksAndroidWorld programmatic task suitebenchmarks
test-set:appworld-750-cross-app-tasksAppWorld 750 cross-app tasksbenchmarks
test-set:assistantbench-214-web-tasksAssistantBench 214 web tasksbenchmarks
test-set:bfcl-v3Berkeley Function Calling Leaderboard v3benchmarks
test-set:bigcode-evalplusBigCode EvalPlusbenchmarks
test-set:bigcodebench-fullBigCodeBench full setbenchmarks
test-set:ds1000-fullDS-1000 full setbenchmarks
test-set:flores-200-devtestFLORES-200 devtestbenchmarks
test-set:gaia-validationGAIA validation splitbenchmarks
test-set:gpqa-diamondGPQA Diamondbenchmarks
test-set:gpqa-diamond-2024GPQA Diamond — 2024 releasebenchmarks
test-set:gsm8k-testGSM8K test splitbenchmarks
test-set:hellaswag-validationHellaSwag validationbenchmarks
test-set:humaneval-originalHumanEval original problem setbenchmarks
test-set:livecodebench-2024-12LiveCodeBench 2024-12 cutbenchmarks
test-set:math-testMATH test splitbenchmarks
test-set:mbpp-fullMBPP full problem setbenchmarks
test-set:mle-bench-competitionsMLE-bench Kaggle competition setbenchmarks
test-set:osworld-369-computer-tasksOSWorld 369 real-computer tasksbenchmarks
test-set:re-bench-ai-rd-tasksRE-Bench AI R&D task suitebenchmarks
test-set:swe-bench-originalSWE-bench original test setbenchmarks
test-set:swe-bench-verified-2024-12SWE-bench Verified 2024-12benchmarks
test-set:tau-bench-airline-retailtau-bench airline and retail domainsbenchmarks
test-set:terminal-bench-v1Terminal-Bench v1benchmarks
test-set:the-agent-company-workplace-tasksTheAgentCompany workplace task suitebenchmarks
test-set:toolbench-toolevalToolBench ToolEval suitebenchmarks
test-set:travelplanner-planning-intentsTravelPlanner planning intentsbenchmarks
test-set:truthful-qa-mcTruthfulQA — multiple-choicebenchmarks
test-set:webarena-v1WebArena v1 task suitebenchmarks

Active filters

No active facet filters.

Sort

id-asc
id-desc
name-asc
name-desc