II.
Benchmark overview
Reference · livebenchmark:bigcode-evalplus
EvalPlus overview
EvalPlus extends HumanEval and MBPP with 80x more high-quality tests per task to expose flaky correctness in LLM-generated code, yielding HumanEval+ and MBPP+ leaderboards.
Attributes
displayName
EvalPlus
homepageUrl
kind
code-functional-correctness
targetsKind
ModelVersion
description
EvalPlus extends HumanEval and MBPP with 80x more high-quality
tests per task to expose flaky correctness in LLM-generated code,
yielding HumanEval+ and MBPP+ leaderboards.
Outgoing edges
None.
Incoming edges
belongs_to_benchmark1
- test-set:bigcode-evalplus·TestSetBigCode EvalPlus
bounds_subject1
- scope-boundary:bigcode-evalplus.scope·ScopeBoundary
for_benchmark3
scored_against3
- eval-result:human-eval-plus.claude-sonnet-4-5.001·EvalResult
- eval-result:human-eval-plus.gpt-5.001·EvalResult
- eval-result:evalplus.gpt-5.001·EvalResult