Agentic AI Atlas

Agentic AI Atlasby a5c.ai

GitHub Docs Discord

Dark mode

iiiNode kind

Agentic AI Atlas · TestSet

1 recordsa5c.ai

III.

Node kind ledger

TestSet

Page 1 of 1

TestSet records

Browse all TestSet records in the current atlas snapshot.

Cluster · benchmarksTotal · 31Visible · 1

homepageUrl: https://github.com/evalplus/evalplus x clear all

Filters & facets1 active · 3 groups

homepageUrl

https://github.com/idavidrein/gpqa · 2 https://github.com/evalplus/evalplus · 1 https://github.com/openai/grade-school-math · 1 https://rowanzellers.com/hellaswag/ · 1 https://livecodebench.github.io/ · 1 https://github.com/hendrycks/math · 1 https://www.swebench.com/original.html · 1 https://webarena.dev/static/paper.pdf · 1 https://openreview.net/forum?id=roNSXZpUDN · 1 https://arxiv.org/abs/2406.15877 · 1 https://ds1000-code-gen.github.io/ · 1 https://openai.com/index/mle-bench/ · 1

releasedAt

2023-11-29 · 2 2024-12-01 · 2 2024-06-17 · 2 2023-05-08 · 1 2021-10-27 · 1 2019-05-19 · 1 2021-03-05 · 1 2023-10-10 · 1 2023-07-26 · 1 2022-11-21 · 1 2024-10-09 · 1 2021-07-07 · 1

description

Canonical WebArena task artifact for autonomous web-agent evaluation. · 1 Canonical full-set artifact for BigCodeBench code-generation evaluation. · 1 Canonical DS-1000 artifact for data-science code-generation evaluation. · 1 Canonical HumanEval artifact for Python code-generation evaluation. · 1 Canonical MBPP artifact for basic Python program-synthesis evaluation. · 1 Canonical AgentBench artifact for broad LLM-as-agent evaluation. · 1 Canonical ToolBench evaluation artifact for API tool-use benchmarks. · 1 Canonical AndroidWorld artifact for autonomous Android UI-control evaluation. · 1 Canonical RE-Bench artifact for frontier AI R&D agent evaluation. · 1 The December 2024 release of the SWE-bench Verified test set. · 1

id	displayName	cluster
test-set:bigcode-evalplus	BigCode EvalPlus	benchmarks