Why AI Labs Are Now Testing Models With Unlimited Compute,And Why Your Benchmarks Are Broken
AI models perform significantly better when given more time to reason at test time, but most safety evaluations ignore this entirely. This gap between how we test AI systems and how they actually perform in the real world is forcing researchers and AI labs to rethink their entire approach to benchmarking and safety assessment.
What Is Test-Time Compute and Why Does It Matter?
Test-time compute refers to the computational resources an AI model uses during inference, or the moment when it's actually answering your question. Unlike training compute, which happens once when building the model, test-time compute can be adjusted on a per-query basis. A model given more time to think through a problem step-by-step will often produce better answers than one forced to respond instantly.
The problem is that most AI benchmarks don't measure this. When researchers publish benchmark scores, they typically don't disclose how much compute was used to achieve those results. This creates a misleading picture of a model's true capabilities. A model that scores 85% on a benchmark using unlimited reasoning time looks identical to one that scores 85% using minimal compute, even though they're fundamentally different systems.
"Benchmark performance increasingly often scales with compute allocations, and improved models are often about 'gets to a high level faster,' so any score requires the context of how much compute was required," noted Noam Brown, researcher at OpenAI.
Noam Brown, OpenAI
This issue became impossible to ignore when models like Gemini 3 DeepThink showed dramatic benchmark improvements without any explanation of the safety implications. The deeper issue, experts argue, is that proper safety evaluation requires testing models under "all of it" compute until you can't benefit much from more of it, using the best available reasoning scaffold.
How Are AI Labs Actually Using Test-Time Compute Now?
Researchers are already building systems that leverage test-time adaptation in sophisticated ways. One recent framework, MODF-SIR (Multi-agent Omni-modal Distilled Framework for Social Intelligence Reasoning), demonstrates how test-time compute can be integrated throughout an entire reasoning pipeline.
The system uses what researchers call Test-Time Adaptation (TTA), which allows models to adjust their parameters during inference based on the specific input they're processing. Rather than using fixed weights, the model fine-tunes itself for each individual query using a technique called Low-Rank Adaptation (LoRA). After answering the question, these temporary adjustments are discarded, leaving the base model unchanged. This approach allows the model to reason more effectively on complex tasks like understanding human social dynamics and intentions from multimodal data.
The framework also incorporates iterative self-correction. When an external teacher model evaluates the reasoning output and finds it suboptimal, the system automatically updates its parameters and re-reasons until the quality meets the teacher's standards. This happens entirely at test time, meaning the computational cost scales with task difficulty rather than being fixed upfront.
Steps to Account for Test-Time Compute in AI Evaluation
- Publish Inference Budgets: AI labs should report the tokens, cost, or time required to achieve any benchmark result. At minimum, newly released models should include inference budget information alongside performance scores.
- Track Leaderboard Compute: Benchmarks should explicitly display inference usage on leaderboards or establish a fixed token and cost budget. Many benchmarks have already shifted in this direction, but it is not yet standard practice across the industry.
- Update Safety Frameworks: Preparedness frameworks and responsible scaling policies must explicitly account for inference compute when determining whether a model crosses a safety threshold. Evaluations should estimate capabilities at multiple inference budgets, including projections from smaller-budget runs with stated uncertainty.
Why This Changes Everything About Model Comparisons
The practical implications are significant. When Claude Fable 5 was evaluated on the Agents' Last Exam (ALE), a benchmark built from real-world work tasks, it achieved performance similar to GPT-5.5 and Composer 2.5. However, at current pricing, Fable 5 delivers that similar performance while costing roughly 4 to 12 times more per completed task. This kind of comparison is only meaningful if you know the compute budget used for each model.
The issue extends beyond pricing. For most tasks, returns to capability follow a sigmoid curve. There's a minimum level of AI capability required to complete a task at all. Below that threshold, the AI is little help. Above a certain point, additional compute yields diminishing returns. But the exact shape of that curve matters enormously for safety evaluation, and it's invisible in current benchmark reporting.
Experts argue that if you evaluated a model's safety under one compute budget and then later release a version with dramatically higher test-time compute, you need to conduct safety evaluation all over again. It represents a substantial advance from where you set initial expectations, and the safety profile may have changed significantly.
What Happens When You Release a Model With Better Test-Time Reasoning?
The release of Claude Fable 5 highlighted this tension. The model incorporates strong safeguards and represents a Mythos-class capability level, but the full implications of its test-time reasoning abilities require careful analysis of the model card and system documentation. This is precisely the kind of situation where test-time compute accounting becomes critical for understanding what the model can actually do.
The broader pattern is clear: as AI systems become more sophisticated, the gap between their performance under minimal compute and their performance under optimal conditions grows wider. Benchmarks that don't account for this gap are increasingly misleading. They tell you what a model can do under ideal conditions, which is useful information, but they don't tell you what it can do under realistic deployment conditions, or what it might be capable of if someone allocates unlimited compute to a single query.
For the AI research community, the path forward requires transparency. Labs need to publish inference budgets alongside benchmark scores, benchmarks need to track compute usage on leaderboards, and safety frameworks need to explicitly evaluate models at multiple inference budgets. Without these changes, the numbers on a model card will continue to obscure more than they reveal about what these systems can actually do.