OpenAI's Noam Brown Says AI Benchmarks Are Broken. Here's Why That Matters.
OpenAI researcher Noam Brown has published a detailed critique of how the AI industry evaluates and ranks its most powerful models, arguing that current benchmark leaderboards are fundamentally flawed because they ignore a critical variable: how much computational resources each model spent to produce its results. The same model given one dollar to think versus ten thousand dollars to think produces vastly different scores, yet leaderboards fail to disclose these costs.
Why Are Current AI Rankings Misleading?
Brown's core argument is straightforward: comparing AI models on standard benchmarks without accounting for inference budget is like comparing two students' test scores without mentioning that one had 30 minutes and the other had three hours. The comparison tells you almost nothing meaningful.
Consider the practical example Brown provides. On the MMLU benchmark, a widely used knowledge test, all state-of-the-art models now cluster above 88 percent accuracy, with score differences so small they fall within statistical noise. What you're seeing isn't "who is smarter," Brown argues, but measurement error.
The problem becomes even more stark when looking at longer, more complex tasks. On the MRCR v2 test requiring models to process 1 million tokens of text, GPT-5.4 scored 36.6 percent while GPT-5.5 scored 74.0 percent, more than doubling the performance. Yet this metric doesn't appear in standard benchmark tables.
The most extreme example involves OpenAI's o3 model on the ARC-AGI benchmark. The o3 achieved the highest score, but at a reasoning cost of thirty thousand dollars per question. A neighboring team using a small model with 4 billion parameters achieved 24 percent accuracy at just twenty cents per question. When the cost difference is 150,000 times larger, Brown argues, the question of "who ranks higher" becomes meaningless.
How Should AI Models Be Evaluated Fairly?
Brown proposes a fundamental shift in how the industry measures AI capabilities. Instead of reporting a single benchmark score, labs should publish performance curves that show how a model's abilities improve as you give it more computational resources to think.
The x-axis of such a curve could represent the number of tokens processed, the dollar cost spent, or the time elapsed. The y-axis would show the model's performance on a specific task. This approach mirrors how human testing works: the SAT provides a fixed time limit, and the International Mathematical Olympiad also provides a fixed time. Only AI evaluations, Brown notes, continue to ignore how much "thinking budget" is allocated.
Brown offers three specific recommendations for the AI industry:
- Publish Performance Curves: When releasing a new model, labs should clearly indicate the inference budget corresponding to each score, showing how performance scales with computational resources.
- Set Clear Budget Caps: Benchmark rankings should either track inference usage or establish a fixed budget limit, similar to how ARC-AGI already operates.
- Incorporate Inference Budgets into Security Assessments: Security evaluations should explicitly account for computational reasoning, since nation-state attackers could allocate up to ten million dollars in reasoning budget for a single task.
Brown illustrated his point with a concrete example comparing GPT-5.4 and GPT-5.5. When viewed through a traditional benchmark lens, GPT-5.5 appears only slightly better than GPT-5.4. But when the x-axis switches to token count, the curve for GPT-5.5 far outperforms GPT-5.4, particularly on cybersecurity assessments.
The pricing difference between these models underscores why this matters. GPT-5.4 Pro costs thirty dollars per million input tokens and one hundred eighty dollars per million output tokens, while GPT-5.5 costs five dollars and thirty dollars respectively. That's a sixfold price difference, yet benchmarks compare them as if they operate on the same scale.
What Changed in AI Development to Make This Problem Urgent?
Two years ago, inference-time computation was primarily associated with OpenAI's o1 model, which introduced the concept of "trading reasoning time for accuracy" to the public. By 2026, this capability has become standard across all state-of-the-art models.
GPT-5.5 Pro, for instance, is not an entirely new model. It uses the same foundation as GPT-5.5 but adds parallel inference computation: when faced with difficult problems, it runs multiple reasoning chains and synthesizes the results. Claude has extended thinking, Gemini has Deep Think, and nearly every leading AI lab is moving in the same direction.
Brown cited research from Karpathy and the AI Safety Institute showing that stronger models yield higher returns over longer thinking horizons. A weak model might plateau after thinking for two additional minutes, but a strong model can continue improving even after thinking for two hours. Each time a new model is released, if you only run benchmarks under a fixed inference budget, you're only seeing the tip of the iceberg. The true upper limit of its capabilities lies in the waters you can't afford to test.
"We may have no idea where the capability ceiling of modern LLMs lies because the cost of measurement is too high," Brown stated.
Noam Brown, OpenAI
This creates a practical problem for AI labs. Agent deployment cycles are now outpacing the development cycles of new models. Before you've finished evaluating the long-term behavior of one generation, the next one has already been released.
Brown's critique reflects a broader tension in AI development: as models become more capable and more expensive to run at full capacity, the industry's evaluation methods have failed to keep pace. The benchmarks that once provided clear signals about model progress now obscure more than they reveal, making it difficult for researchers, companies, and the public to understand what these systems can actually do.