Logo
FrontierNews.ai

Why AI Agents Excel at Answering Questions But Fail at Asking Them

AI language models are trained overwhelmingly to respond to user queries, not to formulate their own information-seeking strategies. A new study from MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) and Harvard's School of Engineering and Applied Sciences (SEAS) has quantified this asymmetry using a collaborative version of the board game Battleship, revealing that today's AI agents are far better at answering well-posed questions than generating strategic questions under uncertainty.

What Is the Question-Asking Gap in AI Agents?

Researchers built a benchmark called "Collaborative Battleship" where two AI agents play together. One agent, the "captain," asks yes-or-no questions and decides where to fire. The other, the "spotter," answers questions based on a hidden board only it can see. This setup isolates two distinct cognitive skills: information-seeking (asking) and information-providing (answering).

The results were striking. When language models serve as spotters answering questions posed by humans or other agents, they perform at or near human level across all model sizes. But when the same models must act as captains and generate their own questions, performance collapses dramatically.

"Today's language models are primarily optimized to answer complex queries, but it's less clear whether they learn to ask good questions for themselves," stated Gabriel Grand, lead author and researcher at MIT CSAIL.

Gabriel Grand, Researcher at MIT CSAIL

This asymmetry reflects how modern language models are trained. They are optimized overwhelmingly to respond to user queries, not to formulate novel information-seeking strategies autonomously. For autonomous agents operating in real-world environments like scheduling, negotiation, scientific experimentation, and customer service, this gap matters significantly because they must routinely decide what to ask, not merely how to respond.

How Can Inference-Time Reasoning Close the Gap?

The researchers addressed the question-asking deficit by augmenting agents with explicit "world models" and Monte Carlo-style Bayesian inference at decision time. Rather than relying on the language model's implicit reasoning to choose the next question, the system samples possible board states consistent with prior answers, evaluates the expected information gain of candidate questions, and selects the one that maximally reduces uncertainty.

The results were dramatic. When applied to Meta's Llama-4-Scout, a comparatively small open-weight model, the technique lifted its win rate against human opponents from approximately 8% to more than 80%. In head-to-head matches between agents, the inference-augmented Llama-4-Scout reportedly matched or exceeded the performance of GPT-5, OpenAI's frontier model, while consuming roughly 1% of the compute budget.

How to Implement Test-Time Reasoning for Better Agent Performance

  • Deploy Monte Carlo Inference: Use sampling-based Bayesian methods at decision time to evaluate candidate actions and select those maximizing expected information gain, rather than relying solely on the model's implicit reasoning.
  • Build Explicit World Models: Construct representations of possible states consistent with prior observations, allowing the agent to reason about uncertainty and plan strategically before committing to an action.
  • Measure Against Ground Truth: When possible, design tasks with computable optimal solutions so you can quantify exactly how far your agent's reasoning deviates from the Bayesian ideal, enabling precise performance measurement.
  • Prioritize Question Quality: Treat question-generation ability as a first-class evaluation metric alongside answer accuracy, especially for agents that must actively seek information in real-world deployments.

The cost differential is notable. It suggests that strategic inference-time computation can substitute for orders-of-magnitude increases in model scale when the task is well-structured. For organizations deploying autonomous agents, this finding has direct implications: smaller, cheaper models augmented with structured reasoning may outperform expensive frontier systems on well-defined agentic tasks.

What Are the Limitations of This Approach?

Collaborative Battleship is a constrained, rule-based environment with a finite state space. The questions are binary, the board is discrete, and the optimal strategy is computable, conditions that rarely hold in open-ended real-world tasks. The researchers acknowledge that extending these inference-time strategies to domains with continuous state spaces, ambiguous language, or adversarial interlocutors remains an open problem.

Still, the benchmark's precision is its strength. Because the Bayesian optimum is known, the study provides a rare quantitative measure of the gap between what language models actually ask and what they should ask, a metric that vaguer evaluations cannot supply. The research was presented at ICLR 2026, a top-tier machine learning conference, and announced by MIT CSAIL in June 2026.

For the emerging agent economy, the question-asking gap identified by MIT CSAIL and Harvard SEAS is a structural bottleneck. Autonomous agents deployed in commerce, research, and infrastructure must actively seek information, querying APIs, clarifying user intent, and probing uncertain environments, not merely respond to prompts. The finding that inference-time Bayesian strategies can close this gap at 1% of frontier-model compute cost has direct implications for how organizations should design and deploy AI agents in production environments.