Logo
FrontierNews.ai

Claude's Accuracy Crisis in Biology Reveals a Bigger Problem With AI Agents

Anthropic's latest research exposes a critical flaw in how AI agents access external data, not in the models themselves. When frontier AI models like Claude Sonnet 4 were tested on identical viral sequence retrieval queries, accuracy plummeted to as low as 16.9% across repeated runs. But the problem wasn't the model's reasoning ability. It was broken data infrastructure. A single deterministic retrieval tool called gget virus pushed accuracy past 92% across all tested models, revealing that many AI agent failures stem from unreliable data access layers, not model limitations.

What Did Anthropic's Biology Research Actually Find?

In June 2026, Anthropic published a research paper titled "Paving the Way for Agents in Biology," which introduced a benchmark called VirBench containing 120 viral sequence retrieval queries spanning 40 different pathogens. Six frontier models were tested without specialized tooling. The results were sobering. Claude Sonnet 4 achieved 16.9% accuracy, while GPT-5.5 reached 91.3% on the same queries. But here's the critical detail: results varied wildly between runs on identical questions, meaning the models couldn't reliably retrieve the same information twice.

The models understood the biological questions perfectly. They could reason through complex viral sequences and pathogen data. The breakdown happened at the retrieval step. Biological databases are scattered across multiple systems. Application programming interfaces (APIs) are inconsistent. Results change between calls. An AI agent would reason correctly and then build its answer on a broken data retrieval step, producing unreliable outputs.

How Did Adding One Tool Change Everything?

Anthropic's team collaborated with the National Center for Biotechnology Information (NCBI) to build gget virus, a deterministic tool that coordinates NCBI's REST, Datasets, and E-utilities APIs, handles large-result batching, and returns standardized logged output. The transformation was dramatic. With gget virus in place, every model in the benchmark crossed 92% accuracy. Claude Sonnet 4 jumped from 16.9% to 92.8%. GPT-5.5 improved from 91.3% to 99.7%. Run-to-run stability reached between 0.92 and 1.00 across the board, meaning results became consistent and reproducible.

This finding carries profound implications beyond biology. Any developer building AI agent pipelines that reach external APIs, databases, or services faces the same compounding reliability issue. The research draws an explicit conclusion worth highlighting: cheaper models with the right deterministic tool outperformed expensive models without one. Before investing in a larger, more expensive AI model to fix agent reliability, developers should audit whether their data access layer is the actual bottleneck.

Steps to Improve AI Agent Reliability in Your Systems

  • Audit Your Data Access Layer First: Before upgrading to a more expensive or powerful AI model, examine whether your external data retrieval is the bottleneck. Test your agent's consistency across repeated queries to the same data source.
  • Build Deterministic Retrieval Tools: Create standardized, consistent interfaces to your databases and APIs that return logged, reproducible results. This prevents the model from receiving different answers on identical queries.
  • Test Run-to-Run Stability: Measure whether your AI agent produces the same answer when asked the same question multiple times. If results vary, the problem is likely data access, not model reasoning.
  • Prioritize Data Infrastructure Over Model Size: Invest in reliable data pipelines and deterministic tools before reaching for larger or more expensive models. A smaller model with clean data access will outperform a larger model with broken retrieval.

What Else Is Anthropic Building in Life Sciences?

The VirBench research is just one piece of a larger strategic push by Anthropic into scientific infrastructure. In February 2026, the company announced flagship partnerships with the Allen Institute and the Howard Hughes Medical Institute (HHMI). In April, Anthropic acquired Coefficient Bio, a stealth drug discovery AI startup that was only eight months old with ten employees, for $400 million. The acquisition brought in operational biotech expertise for drug target selection and clinical regulatory strategy.

Anthropic has opened actual wet labs and is working with Bristol Myers Squibb to deploy Claude across research and development and manufacturing operations. The stated goal is a 10-fold compression of life sciences research and development timelines, with a specific focus on making currently "undruggable" targets accessible to pharmaceutical researchers.

Separately, Anthropic released BioMysteryBench, a dataset of 99 real bioinformatics questions written by domain experts across DNA and RNA sequencing, proteomics, and metabolomics. Human experts solved 76 of the 99 questions. Claude Mythos Preview averaged 82.6% accuracy across five trials and solved seven of the 23 questions that no human expert cracked. However, Anthropic was transparent about the limitations: roughly 44% of Mythos's wins on the hardest questions were "brittle," meaning they reproduced in fewer than two of five attempts. The model can reach research-grade answers, but consistency on the hardest problems remains an engineering constraint.

Why Does This Matter Beyond Biology?

The VirBench lesson applies to any AI agent that touches external systems. Whether you're building customer service bots that query customer databases, financial analysis tools that pull market data, or supply chain optimization systems that access inventory APIs, the same principle holds: unreliable data access will break your agent's reliability, regardless of how powerful the underlying model is. The research demonstrates that infrastructure matters as much as model capability when deploying AI agents in production environments.

On June 30 at 10 a.m. Pacific Standard Time, Anthropic is hosting "The Briefing: AI for Science," a live-streamed event for pharma executives, lab directors, and biotech founders. Given the timing, with John Jumper's hire announced on June 19 and the event nine days later, this is likely where Anthropic reveals its next move in life sciences tooling. For anyone building anything in health, biology, or scientific data pipelines, the event is worth attending. And for anyone building AI agents in any domain, the VirBench research is the paper to read this month.

" }