III.
Node kind ledger
Page 1 of 2Benchmark
Benchmark records
Browse all Benchmark records in the current atlas snapshot.
Filters & facets4 groups
homepageUrl
https://www.swebench.com/ · 2https://github.com/THUDM/AgentBench · 1https://github.com/hendrycks/apps · 1https://os-world.github.io/ · 1https://google-research.github.io/android_world/ · 1https://metr.org/AI_R_D_Evaluation_Report.pdf · 1https://appworld.dev/ · 1https://assistantbench.github.io/ · 1https://the-agent-company.com/ · 1https://agentclinic.github.io/ · 1https://osu-nlp-group.github.io/TravelPlanner/ · 1https://openai.com/index/browsecomp/ · 1
kind
description
General AI Assistants benchmark — real-world agent reasoning tasks.
· 1Hand-written programming problems for evaluating code generation.
· 1MBPP+ from EvalPlus — augmented MBPP with substantially expanded test suites.
· 1Machine learning engineering tasks drawn from Kaggle competitions.
· 1Massive Multitask Language Understanding — 57-subject knowledge benchmark.
· 1Real-world software engineering issues from open-source Python repos.
· 1
targetsKind
| id | displayName | cluster |
|---|---|---|
| benchmark:advbench | AdvBench | benchmarks |
| benchmark:agentbench | AgentBench | benchmarks |
| benchmark:agentboard | AgentBoard | benchmarks |
| benchmark:agentclinic | AgentClinic | benchmarks |
| benchmark:aider-polyglot | Aider Polyglot | benchmarks |
| benchmark:android-world | AndroidWorld | benchmarks |
| benchmark:apps | APPS | benchmarks |
| benchmark:appworld | AppWorld | benchmarks |
| benchmark:arc-agi-3 | ARC-AGI 3 | benchmarks |
| benchmark:arc-challenge | ARC-Challenge | benchmarks |
| benchmark:assistant-bench | AssistantBench | benchmarks |
| benchmark:bbh | BIG-Bench Hard (BBH) | benchmarks |
| benchmark:berkeley-function-calling | Berkeley Function Calling Leaderboard (BFCL) | benchmarks |
| benchmark:bias-bench | BBQ (Bias Benchmark for QA) | benchmarks |
| benchmark:bigcode-evalplus | EvalPlus | benchmarks |
| benchmark:bigcodebench | BigCodeBench | benchmarks |
| benchmark:browse-comp | BrowseComp | benchmarks |
| benchmark:cyber-bench | CyberBench | benchmarks |
| benchmark:ds1000 | DS-1000 | benchmarks |
| benchmark:fin-bench | FinBench | benchmarks |
| benchmark:flores-200 | FLORES-200 | benchmarks |
| benchmark:frontier-math | FrontierMath | benchmarks |
| benchmark:gaia | GAIA | benchmarks |
| benchmark:gpqa | GPQA | benchmarks |
| benchmark:gsm-symbolic | GSM-Symbolic | benchmarks |
| benchmark:gsm8k | GSM8K | benchmarks |
| benchmark:harmbench | HarmBench | benchmarks |
| benchmark:hellaswag | HellaSwag | benchmarks |
| benchmark:hle | Humanity's Last Exam (HLE) | benchmarks |
| benchmark:human-eval | HumanEval | benchmarks |
| benchmark:jailbreakbench | JailbreakBench | benchmarks |
| benchmark:legal-bench | LegalBench | benchmarks |
| benchmark:livecodebench | LiveCodeBench | benchmarks |
| benchmark:lmsys-arena | Chatbot Arena (LMSYS) | benchmarks |
| benchmark:m-mmlu | Multilingual MMLU (mMMLU) | benchmarks |
| benchmark:math | MATH | benchmarks |
| benchmark:mbpp | MBPP | benchmarks |
| benchmark:mbpp-plus | MBPP+ | benchmarks |
| benchmark:medqa | MedQA | benchmarks |
| benchmark:mgsm | MGSM | benchmarks |
| benchmark:mind2web-2 | Mind2Web 2 | benchmarks |
| benchmark:mle-bench | MLE-bench | benchmarks |
| benchmark:mmlu | MMLU | benchmarks |
| benchmark:mt-bench | MT-Bench | benchmarks |
| benchmark:multipl-e | MultiPL-E | benchmarks |
| benchmark:olympiad-bench | OlympiadBench | benchmarks |
| benchmark:os-world | OSWorld | benchmarks |
| benchmark:promptbench | PromptBench | benchmarks |
| benchmark:re-bench | RE-Bench | benchmarks |
| benchmark:repobench | RepoBench | benchmarks |