Agentic AI Atlas

III.

Node kind ledger

TestSet

Page 1 of 1

TestSet records

Browse all TestSet records in the current atlas snapshot.

Cluster · benchmarksTotal · 31Visible · 31

Filters & facets3 groups

id	displayName	cluster
test-set:agentbench-environments	AgentBench multi-environment suite	benchmarks
test-set:agentclinic-clinical-cases	AgentClinic clinical case suite	benchmarks
test-set:androidworld-programmatic-tasks	AndroidWorld programmatic task suite	benchmarks
test-set:appworld-750-cross-app-tasks	AppWorld 750 cross-app tasks	benchmarks
test-set:assistantbench-214-web-tasks	AssistantBench 214 web tasks	benchmarks
test-set:bfcl-v3	Berkeley Function Calling Leaderboard v3	benchmarks
test-set:bigcode-evalplus	BigCode EvalPlus	benchmarks
test-set:bigcodebench-full	BigCodeBench full set	benchmarks
test-set:ds1000-full	DS-1000 full set	benchmarks
test-set:flores-200-devtest	FLORES-200 devtest	benchmarks
test-set:gaia-validation	GAIA validation split	benchmarks
test-set:gpqa-diamond	GPQA Diamond	benchmarks
test-set:gpqa-diamond-2024	GPQA Diamond — 2024 release	benchmarks
test-set:gsm8k-test	GSM8K test split	benchmarks
test-set:hellaswag-validation	HellaSwag validation	benchmarks
test-set:humaneval-original	HumanEval original problem set	benchmarks
test-set:livecodebench-2024-12	LiveCodeBench 2024-12 cut	benchmarks
test-set:math-test	MATH test split	benchmarks
test-set:mbpp-full	MBPP full problem set	benchmarks
test-set:mle-bench-competitions	MLE-bench Kaggle competition set	benchmarks
test-set:osworld-369-computer-tasks	OSWorld 369 real-computer tasks	benchmarks
test-set:re-bench-ai-rd-tasks	RE-Bench AI R&D task suite	benchmarks
test-set:swe-bench-original	SWE-bench original test set	benchmarks
test-set:swe-bench-verified-2024-12	SWE-bench Verified 2024-12	benchmarks
test-set:tau-bench-airline-retail	tau-bench airline and retail domains	benchmarks
test-set:terminal-bench-v1	Terminal-Bench v1	benchmarks
test-set:the-agent-company-workplace-tasks	TheAgentCompany workplace task suite	benchmarks
test-set:toolbench-tooleval	ToolBench ToolEval suite	benchmarks
test-set:travelplanner-planning-intents	TravelPlanner planning intents	benchmarks
test-set:truthful-qa-mc	TruthfulQA — multiple-choice	benchmarks
test-set:webarena-v1	WebArena v1 task suite	benchmarks