Agentic AI Atlasby a5c.ai
OverviewWikiGraphFor AgentsEdgesSearchWorkspace
/
GitHubDocsDiscord
iiiNode kind
Agentic AI Atlas · TestSet
1 recordsa5c.ai
Search kind facets/
Atlas · node kind

Current kind and facets

III.TestSetpp. 1 - 1
homepageUrl: https://github.com/idavidrein/gpqahomepageUrl: https://github.com/evalplus/evalplusreleasedAt: 2023-11-29releasedAt: 2024-12-01description: Canonical WebArena task artifact for autonomous web-agent evaluation. description: Canonical full-set artifact for BigCodeBench code-generation evaluation.
III.
Node kind ledger

TestSet

Page 1 of 1

TestSet records

Browse all TestSet records in the current atlas snapshot.

Cluster · benchmarksTotal · 31Visible · 1
homepageUrl: https://github.com/evalplus/evalplus xclear all
Filters & facets1 active · 3 groups

homepageUrl

https://github.com/idavidrein/gpqa · 2https://github.com/evalplus/evalplus · 1https://github.com/openai/grade-school-math · 1https://rowanzellers.com/hellaswag/ · 1https://livecodebench.github.io/ · 1https://github.com/hendrycks/math · 1https://www.swebench.com/original.html · 1https://webarena.dev/static/paper.pdf · 1https://openreview.net/forum?id=roNSXZpUDN · 1https://arxiv.org/abs/2406.15877 · 1https://ds1000-code-gen.github.io/ · 1https://openai.com/index/mle-bench/ · 1

releasedAt

2023-11-29 · 22024-12-01 · 22024-06-17 · 22023-05-08 · 12021-10-27 · 12019-05-19 · 12021-03-05 · 12023-10-10 · 12023-07-26 · 12022-11-21 · 12024-10-09 · 12021-07-07 · 1

description

Canonical WebArena task artifact for autonomous web-agent evaluation. · 1Canonical full-set artifact for BigCodeBench code-generation evaluation. · 1Canonical DS-1000 artifact for data-science code-generation evaluation. · 1Canonical HumanEval artifact for Python code-generation evaluation. · 1Canonical MBPP artifact for basic Python program-synthesis evaluation. · 1Canonical AgentBench artifact for broad LLM-as-agent evaluation. · 1Canonical ToolBench evaluation artifact for API tool-use benchmarks. · 1Canonical AndroidWorld artifact for autonomous Android UI-control evaluation. · 1Canonical RE-Bench artifact for frontier AI R&D agent evaluation. · 1The December 2024 release of the SWE-bench Verified test set. · 1
id-ascid-descname-ascname-desc
iddisplayNamecluster
test-set:bigcode-evalplusBigCode EvalPlusbenchmarks

Active filters

homepageUrl: https://github.com/evalplus/evalplus

Sort

id-asc
id-desc
name-asc
name-desc