Agentic AI Atlas

II.

Benchmark overview

benchmark:bigcode-evalplus

Reference · live

EvalPlus overview

EvalPlus extends HumanEval and MBPP with 80x more high-quality tests per task to expose flaky correctness in LLM-generated code, yielding HumanEval+ and MBPP+ leaderboards.

BenchmarkOutgoing · 0Incoming · 8

Attributes

displayName

EvalPlus

homepageUrl

https://evalplus.github.io/

kind

code-functional-correctness

targetsKind

ModelVersion

description

EvalPlus extends HumanEval and MBPP with 80x more high-quality tests per task to expose flaky correctness in LLM-generated code, yielding HumanEval+ and MBPP+ leaderboards.

Outgoing edges

None.

Incoming edges

belongs_to_benchmark1

test-set:bigcode-evalplus·TestSetBigCode EvalPlus

bounds_subject1

scope-boundary:bigcode-evalplus.scope·ScopeBoundary

for_benchmark3

eval-run:human-eval-plus.claude-sonnet-4-5.2025-09·EvalRun
eval-run:human-eval-plus.gpt-5.2025-08·EvalRun
eval-run:evalplus.gpt-5.2025-08·EvalRun

scored_against3

eval-result:human-eval-plus.claude-sonnet-4-5.001·EvalResult
eval-result:human-eval-plus.gpt-5.001·EvalResult
eval-result:evalplus.gpt-5.001·EvalResult