II.
Benchmark overview
Reference · livebenchmark:gpqa
GPQA overview
GPQA (Graduate-Level Google-Proof Q&A) by Rein et al. (2023) is a 448-question multiple-choice benchmark in biology, chemistry, and physics written and validated by domain-expert PhDs. Designed to be "Google-proof" — non-experts with web access score ~34%, in-domain PhDs score ~65%. The Diamond subset (198 questions) is the hardest tier and is the standard reported number in vendor announcements.
Attributes
displayName
GPQA
homepageUrl
kind
model-only
targetsKind
ModelVersion
description
GPQA (Graduate-Level Google-Proof Q&A) by Rein et al. (2023) is a
448-question multiple-choice benchmark in biology, chemistry, and
physics written and validated by domain-expert PhDs. Designed to be
"Google-proof" — non-experts with web access score ~34%, in-domain
PhDs score ~65%. The Diamond subset (198 questions) is the hardest
tier and is the standard reported number in vendor announcements.
Outgoing edges
targets2
- model:claude-opus-4-7@current·ModelVersionClaude 4.7 Opus
- model:claude-opus-4-6@current·ModelVersionClaude 4.6 Opus
uses_test_set1
- test-set:gpqa-diamond·TestSetGPQA Diamond
Incoming edges
for_benchmark12
- eval-run:gpqa.claude-haiku-4-5.2025-10·EvalRun
- eval-run:gpqa-diamond.claude-opus-4-5.2025-09·EvalRun
- eval-run:gpqa.deepseek-r1.2025-01·EvalRun
- eval-run:gpqa.gemini-2-5-pro.2025-06·EvalRun
- eval-run:gpqa-diamond.gemini-2-5-pro.2025-06·EvalRun
- eval-run:gpqa-diamond.gemini-3-1-pro.2026-02-19·EvalRun
- eval-run:gpqa-diamond.gemini-3-pro.2025-11-18·EvalRun
- eval-run:gpqa.gpt-5.2025-08·EvalRun
- eval-run:gpqa-diamond.gpt-5.2025-08·EvalRun
- eval-run:gpqa-diamond.gpt-5-4.2026-03-17·EvalRun
- eval-run:gpqa-diamond.gpt-5-4-mini.2026-03-17·EvalRun
- eval-run:gpqa.claude-sonnet-4-5.2025-09·EvalRun
scored_against8
- eval-result:gpqa-diamond.claude-opus-4-5.001·EvalResult
- eval-result:gpqa.deepseek-r1.001·EvalResult
- eval-result:gpqa-diamond.gemini-2-5-pro.001·EvalResult
- eval-result:gpqa-diamond.gemini-3-1-pro.2026-02-19.accuracy·EvalResult
- eval-result:gpqa-diamond.gemini-3-pro.2025-11-18.accuracy·EvalResult
- eval-result:gpqa-diamond.gpt-5.001·EvalResult
- eval-result:gpqa-diamond.gpt-5-4.2026-03-17.accuracy·EvalResult
- eval-result:gpqa-diamond.gpt-5-4-mini.2026-03-17.accuracy·EvalResult