Why AI Labs Are Now Spending Hours (Not Milliseconds) on a Single Answer

FrontierNews.ai AI Research Desk

Why AI Labs Are Now Spending Hours (Not Milliseconds) on a Single Answer

The AI industry is abandoning speed as its primary virtue. For years, the benchmark for artificial intelligence success was how fast a model could generate text. Now, a growing number of AI labs are flipping that logic entirely, deliberately allocating massive amounts of computing power at inference time, the moment when a model processes your question, to enable deeper, more reliable reasoning instead of instant responses.

What Is Test-Time Compute and Why Does It Matter?

Test-time compute, also called inference-time compute, refers to the computational resources an AI model uses while answering your question, rather than during its initial training phase. Traditionally, AI companies optimized for speed, returning answers in milliseconds. But recent breakthroughs show that allocating more compute at inference time, allowing models to reason through problems methodically over extended periods, can dramatically improve answer quality without requiring expensive retraining.

This shift reflects a fundamental change in how enterprises value AI. Rather than asking "How fast can this answer my question?", businesses are now asking "How deeply can this think through my problem?". The practical implications are significant: instead of a chatbot generating a surface-level summary in seconds, an AI system can spend hours researching, cross-referencing sources, testing hypotheses, and refining conclusions.

How Are Companies Implementing Extended Reasoning at Test Time?

Sakana AI, a Tokyo-based startup, recently launched Sakana Marlin, a commercial product that exemplifies this new paradigm. Rather than engaging in back-and-forth prompt engineering sessions with users, Marlin operates as an autonomous research agent that runs continuous reasoning loops for up to eight hours to deliver deeply researched, well-cited 100-page strategy reports and executive slides.

The engine powering Marlin relies on a technique called Adaptive Branching Monte Carlo Tree Search, or AB-MCTS, which Sakana AI first introduced publicly in June 2025 alongside a research paper titled "Wider or Deeper? Scaling LLM Inference-Time Compute with Adaptive Branching Tree Search". The company released the underlying algorithm as open-source software called TreeQuest under the Apache 2.0 license, laying the technical foundation for what eventually became the enterprise-grade Marlin product.

Here's how the system works in practice:

Exploration vs. Exploitation: At each step of the reasoning process, AB-MCTS dynamically decides whether to spawn entirely new hypotheses when the current path yields diminishing returns, or to methodically refine and audit an existing solution that shows promise.
Multi-Model Orchestration: The system can coordinate multiple leading AI models simultaneously, delegating initial ideation to one model while using a reasoning-heavy model to audit and verify intermediate results generated earlier in the search tree.
Bayesian Decision Framework: Rather than blindly running a model dozens of times in parallel and hoping one answer is correct, AB-MCTS uses a principled, multi-turn approach driven by external feedback signals to guide the reasoning process.

This represents a departure from traditional "repeated sampling," where developers would run a model many times in parallel without any mechanism to evaluate intermediate steps or pivot based on feedback.

What Real-World Results Are Companies Seeing?

Sakana Marlin's commercial launch demonstrates that extended inference-time reasoning can deliver enterprise-grade results. The system produces structured strategic outputs including executive summary slides, appendices, references, and deeply researched reports on complex topics. Real-world use cases highlighted by the company include generating detailed resolution scenarios for theoretical geopolitical crises, mapping global AI regulation frameworks, and analyzing macroeconomic trends.

The approach also shows promise in narrower, more specialized domains. Researchers have demonstrated that hybrid neuro-symbolic architectures, which combine neural continuous processing with symbolic discrete overrides, can achieve fine-tuned model performance without requiring supervised fine-tuning. In one study on irony detection in social media text, a framework called Robust Dual-Signal Fusion achieved 78.1% accuracy on a held-out test set of 734 tweets, matching the performance ceiling of fine-tuned specialized models, while using compressed chain-of-thought reasoning without any parameter updates.

The same framework demonstrated robust cross-domain generalization on a separate, heavily imbalanced dataset, filtering out 22.5% of false positives and achieving an Ironic F1 score of 0.4821, outperforming multiple supervised transformer ensembles from academic benchmarks.

Why Are Enterprises Adopting This Approach?

For enterprises, the shift toward extended inference-time reasoning addresses several critical pain points. First, it eliminates the need for expensive supervised fine-tuning, which requires labeled training data and carries the risk of catastrophic forgetting, where a model loses its general-purpose capabilities when adapted to a narrow task. By demonstrating that inference-time reasoning can recover fine-tuned performance levels, companies can achieve task-specific accuracy while preserving the frozen base model's generalist capabilities.

Second, extended reasoning enables better handling of complex, unstructured problems. Large language models natively default to literal semantic interpretations, making tasks like detecting irony in social media text persistently challenging. By allocating more compute at inference time, systems can apply structured reasoning to overcome these limitations without retraining.

Third, enterprises value data privacy and control. Unlike many consumer-grade AI tools that silently harvest user inputs to train future models, Sakana Marlin operates under strict, enterprise-grade data handling terms, ensuring that proprietary business information remains confidential.

What Does This Mean for the Future of AI Development?

The shift toward test-time compute represents a fundamental reorientation of AI research priorities. For the past two years, the generative AI hype cycle has been defined by speed, with the industry standard being the ability to generate responses in mere milliseconds. But the enterprise frontier is rapidly shifting from shallow, rapid generation to deep, methodical reasoning.

This transition has significant implications for how AI labs allocate resources. Rather than investing primarily in larger models and more training data, companies are now exploring how to make better use of compute at the moment of inference. Sakana AI's decision to release TreeQuest as open-source software suggests that this approach may become a standard tool in the AI developer toolkit, similar to how transformer architectures became foundational to modern language models.

The commercial viability of Sakana Marlin, available immediately via the company's website with pay-as-you-go pricing starting at an enterprise tier, indicates that businesses are willing to pay for deeper reasoning even if it requires waiting hours for results. This willingness to trade speed for quality represents a significant shift in how enterprises evaluate AI tools and may reshape investment priorities across the industry.

Your AI & Tech News Engine

Breaking News

OpenAI and SoftBank Team Up to Defend Japan's Critical Infrastructure Against AI-Powered Cyberattacks

Moonshot AI's Kimi K2.7 Code Takes Aim at GitHub Copilot With Drop-In Compatibility

Better Prompts Beat Better Models: Why ChatGPT Power Users Skip the Upgrades

How Grok Became a National Security Asset: What the Pentagon's Defense of xAI Reveals

Grok Powers Pentagon Operations as xAI Faces Environmental Lawsuit Over Data Center Turbines

Elon Musk's xAI Loses Second Major Legal Battle Against OpenAI in a Month

The Real Money in Humanoid Robots Isn't the Robot Itself

SpaceX's $2.5 Trillion IPO Reveals the Real Game: AI, Not Rockets

Why AI Labs Are Now Spending Hours (Not Milliseconds) on a Single Answer

What Is Test-Time Compute and Why Does It Matter?

How Are Companies Implementing Extended Reasoning at Test Time?

What Real-World Results Are Companies Seeing?

Why Are Enterprises Adopting This Approach?

What Does This Mean for the Future of AI Development?