Agentic AI Atlasby a5c.ai
OverviewWikiGraphFor AgentsEdgesSearchWorkspace
/
GitHubDocsDiscord
iiiNode kind
Agentic AI Atlas · Benchmark
65 recordsa5c.ai
Search kind facets/
Atlas · node kind

Current kind and facets

III.Benchmarkpp. 1 - 1
homepageUrl: https://www.swebench.com/homepageUrl: https://github.com/THUDM/AgentBenchkind: model-onlykind: code-generationdescription: General AI Assistants benchmark — real-world agent reasoning tasks. description: Hand-written programming problems for evaluating code generation. targetsKind: ModelVersiontargetsKind: AgentVersion
III.
Node kind ledger

Benchmark

Page 1 of 2

Benchmark records

Browse all Benchmark records in the current atlas snapshot.

Cluster · benchmarksTotal · 65Visible · 65
Filters & facets4 groups

homepageUrl

https://www.swebench.com/ · 2https://github.com/THUDM/AgentBench · 1https://github.com/hendrycks/apps · 1https://os-world.github.io/ · 1https://google-research.github.io/android_world/ · 1https://metr.org/AI_R_D_Evaluation_Report.pdf · 1https://appworld.dev/ · 1https://assistantbench.github.io/ · 1https://the-agent-company.com/ · 1https://agentclinic.github.io/ · 1https://osu-nlp-group.github.io/TravelPlanner/ · 1https://openai.com/index/browsecomp/ · 1

kind

model-only · 14code-generation · 7full-stack · 7web-agent · 7reasoning · 5math · 4tool-use · 3domain-specific · 2agent-leaderboard · 2knowledge · 2research-engineering · 1planning · 1

description

General AI Assistants benchmark — real-world agent reasoning tasks. · 1Hand-written programming problems for evaluating code generation. · 1MBPP+ from EvalPlus — augmented MBPP with substantially expanded test suites. · 1Machine learning engineering tasks drawn from Kaggle competitions. · 1Massive Multitask Language Understanding — 57-subject knowledge benchmark. · 1Real-world software engineering issues from open-source Python repos. · 1

targetsKind

ModelVersion · 37AgentVersion · 28
id-ascid-descname-ascname-desc
iddisplayNamecluster
benchmark:advbenchAdvBenchbenchmarks
benchmark:agentbenchAgentBenchbenchmarks
benchmark:agentboardAgentBoardbenchmarks
benchmark:agentclinicAgentClinicbenchmarks
benchmark:aider-polyglotAider Polyglotbenchmarks
benchmark:android-worldAndroidWorldbenchmarks
benchmark:appsAPPSbenchmarks
benchmark:appworldAppWorldbenchmarks
benchmark:arc-agi-3ARC-AGI 3benchmarks
benchmark:arc-challengeARC-Challengebenchmarks
benchmark:assistant-benchAssistantBenchbenchmarks
benchmark:bbhBIG-Bench Hard (BBH)benchmarks
benchmark:berkeley-function-callingBerkeley Function Calling Leaderboard (BFCL)benchmarks
benchmark:bias-benchBBQ (Bias Benchmark for QA)benchmarks
benchmark:bigcode-evalplusEvalPlusbenchmarks
benchmark:bigcodebenchBigCodeBenchbenchmarks
benchmark:browse-compBrowseCompbenchmarks
benchmark:cyber-benchCyberBenchbenchmarks
benchmark:ds1000DS-1000benchmarks
benchmark:fin-benchFinBenchbenchmarks
benchmark:flores-200FLORES-200benchmarks
benchmark:frontier-mathFrontierMathbenchmarks
benchmark:gaiaGAIAbenchmarks
benchmark:gpqaGPQAbenchmarks
benchmark:gsm-symbolicGSM-Symbolicbenchmarks
benchmark:gsm8kGSM8Kbenchmarks
benchmark:harmbenchHarmBenchbenchmarks
benchmark:hellaswagHellaSwagbenchmarks
benchmark:hleHumanity's Last Exam (HLE)benchmarks
benchmark:human-evalHumanEvalbenchmarks
benchmark:jailbreakbenchJailbreakBenchbenchmarks
benchmark:legal-benchLegalBenchbenchmarks
benchmark:livecodebenchLiveCodeBenchbenchmarks
benchmark:lmsys-arenaChatbot Arena (LMSYS)benchmarks
benchmark:m-mmluMultilingual MMLU (mMMLU)benchmarks
benchmark:mathMATHbenchmarks
benchmark:mbppMBPPbenchmarks
benchmark:mbpp-plusMBPP+benchmarks
benchmark:medqaMedQAbenchmarks
benchmark:mgsmMGSMbenchmarks
benchmark:mind2web-2Mind2Web 2benchmarks
benchmark:mle-benchMLE-benchbenchmarks
benchmark:mmluMMLUbenchmarks
benchmark:mt-benchMT-Benchbenchmarks
benchmark:multipl-eMultiPL-Ebenchmarks
benchmark:olympiad-benchOlympiadBenchbenchmarks
benchmark:os-worldOSWorldbenchmarks
benchmark:promptbenchPromptBenchbenchmarks
benchmark:re-benchRE-Benchbenchmarks
benchmark:repobenchRepoBenchbenchmarks
Page 1 of 2Next

Active filters

No active facet filters.

Sort

id-asc
id-desc
name-asc
name-desc