Agentic AI Atlasby a5c.ai
OverviewWikiGraphFor AgentsEdgesSearchWorkspace
/
GitHubDocsDiscord
ivEdge detail
Agentic AI Atlas · for_benchmark
67 pairsa5c.ai
Search edge kinds/
Atlas · edge detail

Current ledger and paging

IV.Current edge kindpp. 1 - 1
IV.
Edge detail

for_benchmark

Page 1 of 1

for_benchmark ledger

an EvalRun is for a particular Benchmark (alias of evaluated_by)

Pairs · 67Cardinality · N:1
fromtoto kind
eval-run:mmlu.qwen-2-5-72b.2024-09benchmark:mmluBenchmark
eval-run:human-eval.qwen-2-5-72b.2024-09benchmark:human-evalBenchmark
eval-run:human-eval.qwen-2-5-coder-32b.2024-11benchmark:human-evalBenchmark
eval-run:livecodebench.qwen-2-5-coder-32b.2024-11benchmark:livecodebenchBenchmark
eval-run:mbpp.qwen-2-5-coder-32b.2024-11benchmark:mbppBenchmark
eval-run:swe-bench-verified.claude-haiku-4-5.2025-10benchmark:swe-bench-verifiedBenchmark
eval-run:gpqa.claude-haiku-4-5.2025-10benchmark:gpqaBenchmark
eval-run:human-eval.claude-sonnet-4-6.2025-11benchmark:human-evalBenchmark
eval-run:mmlu.claude-sonnet-4-6.2025-11benchmark:mmluBenchmark
eval-run:bfcl.claude-sonnet-4-5.2025-09benchmark:berkeley-function-callingBenchmark
eval-run:gpqa-diamond.claude-opus-4-5.2025-09benchmark:gpqaBenchmark
eval-run:os-world.claude-sonnet-4-5.2025-09benchmark:os-worldBenchmark
eval-run:truthful-qa.claude-opus-4-5.2025-09benchmark:truthful-qaBenchmark
eval-run:human-eval-plus.claude-sonnet-4-5.2025-09benchmark:bigcode-evalplusBenchmark
eval-run:harmbench.claude-opus-4-5.2025-09benchmark:harmbenchBenchmark
eval-run:arc-challenge.claude-sonnet-4-5.2025-09benchmark:arc-challengeBenchmark
eval-run:mmlu.deepseek-v3.2024-12benchmark:mmluBenchmark
eval-run:human-eval.deepseek-v3.2024-12benchmark:human-evalBenchmark
eval-run:swe-bench.deepseek-v3.2024-12benchmark:swe-bench-verifiedBenchmark
eval-run:mmlu.deepseek-r1.2025-01benchmark:mmluBenchmark
eval-run:math.deepseek-r1.2025-01benchmark:mathBenchmark
eval-run:gpqa.deepseek-r1.2025-01benchmark:gpqaBenchmark
eval-run:gpqa.gemini-2-5-pro.2025-06benchmark:gpqaBenchmark
eval-run:livecodebench.gemini-2-5-pro.2025-06benchmark:livecodebenchBenchmark
eval-run:swe-bench-verified.gemini-2-5-flash.2025-06benchmark:swe-bench-verifiedBenchmark
eval-run:gpqa-diamond.gemini-2-5-pro.2025-06benchmark:gpqaBenchmark
eval-run:android-world.gemini-2-5-pro.2025-06benchmark:android-worldBenchmark
eval-run:mgsm.gemini-2-5-pro.2025-06benchmark:mgsmBenchmark
eval-run:gpqa-diamond.gemini-3-1-pro.2026-02-19benchmark:gpqaBenchmark
eval-run:gpqa-diamond.gemini-3-pro.2025-11-18benchmark:gpqaBenchmark
eval-run:swe-bench.llama-3-1-405b.2024-07benchmark:swe-bench-verifiedBenchmark
eval-run:mmlu.llama-3-1-405b.2024-07benchmark:mmluBenchmark
eval-run:human-eval.llama-3-1-405b.2024-07benchmark:human-evalBenchmark
eval-run:mmlu.llama-3-3-70b.2024-12benchmark:mmluBenchmark
eval-run:human-eval.llama-3-3-70b.2024-12benchmark:human-evalBenchmark
eval-run:mmlu.mistral-large-2.2024-07benchmark:mmluBenchmark
eval-run:human-eval.mistral-large-2.2024-07benchmark:human-evalBenchmark
eval-run:human-eval.codestral-25-01.2025-01benchmark:human-evalBenchmark
eval-run:multipl-e.codestral-25-01.2025-01benchmark:multipl-eBenchmark
eval-run:gpqa.gpt-5.2025-08benchmark:gpqaBenchmark
eval-run:human-eval.gpt-5.2025-08benchmark:human-evalBenchmark
eval-run:mmlu.o1.2024-12benchmark:mmluBenchmark
eval-run:math.o3.2025-04benchmark:mathBenchmark
eval-run:bfcl.gpt-5.2025-08benchmark:berkeley-function-callingBenchmark
eval-run:gpqa-diamond.gpt-5.2025-08benchmark:gpqaBenchmark
eval-run:human-eval-plus.gpt-5.2025-08benchmark:bigcode-evalplusBenchmark
eval-run:gpqa-diamond.gpt-5-4.2026-03-17benchmark:gpqaBenchmark
eval-run:gpqa-diamond.gpt-5-4-mini.2026-03-17benchmark:gpqaBenchmark
eval-run:mmlu.phi-3-medium.2024-05benchmark:mmluBenchmark
eval-run:mmlu.gemma-2-27b.2024-06benchmark:mmluBenchmark
eval-run:gsm8k.gemma-2-27b.2024-06benchmark:gsm8kBenchmark
eval-run:mmlu.command-r-plus.2024-08benchmark:mmluBenchmark
eval-run:swe-bench-verified.claude-opus-4-5.2025-09benchmark:swe-bench-verifiedBenchmark
eval-run:swe-bench-verified.claude-opus-4-7.2026-01benchmark:swe-bench-verifiedBenchmark
eval-run:gpqa.claude-sonnet-4-5.2025-09benchmark:gpqaBenchmark
eval-run:livecodebench.gpt-5.2025-08benchmark:livecodebenchBenchmark
eval-run:swe-bench-verified.o3.2025-04benchmark:swe-bench-verifiedBenchmark
eval-run:swe-bench-verified.gemini-2-5-pro.2025-06benchmark:swe-bench-verifiedBenchmark
eval-run:gsm8k.claude-sonnet-4-5.2025-09benchmark:gsm8kBenchmark
eval-run:hellaswag.claude-opus-4-5.2025-09benchmark:hellaswagBenchmark
eval-run:math.gpt-5.2025-08benchmark:mathBenchmark
eval-run:evalplus.gpt-5.2025-08benchmark:bigcode-evalplusBenchmark
eval-run:terminal-bench.claude-sonnet-4-5.2025-09benchmark:terminal-benchBenchmark
eval-run:gaia.claude-code.2025benchmark:gaiaBenchmark
eval-run:swe-bench.claude-code@1.x.2025-04-29benchmark:swe-bench-verifiedBenchmark
eval-run:swe-bench-verified.claude-sonnet-4-5.2025-09benchmark:swe-bench-verifiedBenchmark
eval-run:swe-bench-verified.gpt-5.2025-08benchmark:swe-bench-verifiedBenchmark

Definition

Source · EvalRun

Target · Benchmark

Cardinality · N:1

Navigate

Back to edge kinds
Open filtered graph