Why AI Benchmark Scores Are Hiding What Models Can Really Do
Frontier AI models are being evaluated under conditions so restrictive that their benchmark scores may tell us almost nothing about what they can actually accomplish. A new study from researchers evaluating 12 leading language models found that when given more computing resources during inference, or the ability to try multiple solution approaches, these models unlock substantially higher performance across mathematics, software engineering, medicine, and cybersecurity tasks. Yet most published evaluations report a single fixed score, equivalent to testing a human expert under severe time pressure.
Why Are Benchmark Scores So Protocol-Dependent?
The core problem is that modern AI benchmarks have shifted toward harder, longer-horizon tasks that benefit from extended reasoning, tool use, and iterative problem-solving. Performance on these tasks increasingly depends on how much inference-time compute, or computing resources available during the model's response phase, evaluations allow. Yet many evaluations still use modest token budgets, give models only one chance to submit a solution, and report results under a single protocol.
Researchers systematically tested this by evaluating frontier language models released between May 2025 and March 2026 on five challenging benchmarks spanning software engineering, mathematics, and medicine. They applied a consistent inference-scaling protocol with expanded token budgets, context compaction, and unlimited submission attempts with or without correctness feedback. The findings revealed three critical insights:
- Benchmark Responsiveness Varies Widely: Some benchmarks continue to improve well beyond typical published token budgets, including FrontierMath, TerminalBench, and Humanity's Last Exam, while others show weaker marginal gains under expanded compute.
- Newer Models Unlock Hidden Capabilities: Newer model generations usually achieve higher performance at large budgets, where they solve harder tasks more reliably. This suggests that low-budget evaluations may fail to track progress in how well models convert additional inference-time compute into performance.
- Intervention Methods Matter Differently by Task: Repeated submission attempts materially improve performance on all benchmarks, but the value of larger token budgets, external feedback, and parallel attempts varies significantly by benchmark type.
What Happens When Models Get More Time to Think?
The research demonstrates that different tasks respond to distinct ways of allocating inference-time compute. Larger token budgets substantially improve performance on benchmarks across multiple domains, including cybersecurity, FrontierMath, Humanity's Last Exam, and TerminalBench. However, the gains are not universal. Some benchmarks benefit more from allowing models to make repeated submission attempts, while others respond better to parallel scaling, where compute is spread across multiple solution attempts rather than concentrated in a single deep reasoning trajectory.
The distinction between serial and parallel scaling proved particularly important. Parallel scaling, where a model generates multiple independent attempts, is strongest on stateless benchmarks like HealthBench and Humanity's Last Exam, which do not involve a persistent interactive environment. Serial scaling, where a model refines a single solution iteratively, performs better on stateful benchmarks that require maintaining context across multiple steps.
How Should AI Labs Redesign Evaluations?
The researchers argue that frontier capability cannot be fully characterized by a single benchmark score measured under a single inference-time protocol. Instead, they recommend that evaluations should report capability as a function of inference-time compute rather than as a single fixed-budget number. Protocol choices should be treated as part of the evaluation design and reported explicitly. When comparing capability across model generations, especially in safety-critical or policy-relevant contexts, researchers should control for the compute range and protocol used.
This shift has immediate implications for how AI labs develop and deploy models. If a model's true capabilities only emerge at higher compute budgets, then evaluations conducted at restrictive budgets may systematically understate progress in frontier AI. Fixed-budget scores give an incomplete picture of the performance frontier reachable under broader inference-time budgets, and this omission may grow as models advance.
The findings also highlight a broader challenge: as benchmarks saturate and AI systems tackle increasingly complex tasks, the relationship between evaluation protocol and observed performance becomes more pronounced. A model that appears to plateau under one set of constraints may continue improving under another. This means that comparing models across different evaluation protocols, or comparing a new model to older published results, requires careful attention to the specific conditions under which each was tested.
For developers and organizations deploying frontier AI systems, the research suggests that real-world performance may exceed what published benchmarks indicate, particularly for tasks that benefit from extended reasoning or iterative refinement. However, it also underscores the importance of testing models under conditions that match actual deployment scenarios, rather than relying solely on published benchmark scores conducted under potentially restrictive protocols.