Claude Just Solved Bioinformatics Problems That Stumped Human Experts

Anthropic has released a new benchmark called BioMysteryBench that measures how well AI models like Claude can tackle real-world bioinformatics challenges, and the results suggest AI is beginning to outpace human experts on some of the field's hardest problems. The benchmark consists of 99 complex bioinformatics problems drawn from actual research scenarios. When tested against this benchmark, Claude Mythos Preview achieved an average accuracy of 82.6% on problems that humans could solve, while also solving approximately 30% of problems that human experts could not solve on their own.

Why Is This Benchmark Different From Other AI Tests?

Unlike traditional AI benchmarks that focus on standardized exams or coding challenges, BioMysteryBench was designed to capture the messy, creative reality of scientific research. Anthropic noted that while there are established benchmarks for software engineering (like SWE-bench), no comparable standard existed for evaluating AI in scientific fields. This gap exists because biology presents unique challenges that make traditional benchmarking difficult.

The benchmark accounts for several real-world complexities in biological research:

  • Multiple Valid Approaches: Unlike mathematics, biology often has several correct methods to reach the same conclusion, making it hard to judge a single "right answer."
  • Subjective Expert Judgment: Individual researchers bring their own interpretations and experience to problems, making standardized evaluation challenging.
  • Noisy Data: Biological datasets can be incomplete or contain errors that lead to entirely different conclusions depending on how they're analyzed.

BioMysteryBench solves these problems by evaluating AI models based on their final biological conclusions rather than the path they took to reach them. This approach allows researchers to test whether Claude's conclusions align with those of human scientists and whether the model can devise creative solutions that humans might not consider.

How Did Claude Models Perform on These Problems?

Anthropic had up to five expert bioinformaticians attempt each of the 99 problems. Out of this set, humans successfully solved 76 problems. When multiple Claude models tackled the same problems, the results varied by model version but consistently exceeded human performance on solvable problems.

  • Claude Mythos Preview: Achieved 82.6% average accuracy across five trials on human-solvable problems.
  • Claude Sonnet 4.6: Exceeded 70% accuracy on the same set of problems.
  • Claude Opus 4.6 and 4.7: Both achieved accuracy rates exceeding 70% on human-solvable problems.

The 23 problems that humans could not solve presented a different challenge. When Claude models attempted these unsolvable-to-humans problems, Claude Mythos Preview reached a maximum accuracy of 30% across five trials. This suggests that while AI can sometimes crack problems beyond human reach, it's not a guaranteed advantage.

What Makes Claude's Approach Different From Human Problem-Solving?

One of the most striking findings from BioMysteryBench is that Claude doesn't always solve problems the same way humans do. In some cases, Claude mimicked human strategies, using algorithms and databases to identify patterns. In other instances, the model took entirely different approaches that humans might never consider.

"Claude was able to intuitively recognize specific patterns and sequences where human experts used algorithms and databases to identify and annotate the characteristics of a dataset,"

Anthropic researchers, in BioMysteryBench analysis

This "intuition" represents a significant capability gap. Anthropic speculated that large-scale language models have the potential to discover patterns on an unprecedented scale, something that traditional biological machine learning models have struggled to achieve. However, this advantage comes with a tradeoff. Analysis of Claude Opus 4.6 revealed that when the model was unsure of an answer, even for simpler problems, it often tried multiple different approaches and sometimes made mistakes by choosing an answer where multiple methods converged.

Steps to Understanding Claude's Bioinformatics Capabilities

  • Recognize the Benchmark's Scope: BioMysteryBench evaluates Claude on 99 real-world bioinformatics problems, not theoretical exercises, making the results directly applicable to actual research scenarios.
  • Compare Performance Across Model Versions: Different Claude versions (Sonnet, Opus, Mythos) show varying accuracy levels, so researchers should select the model that best fits their specific bioinformatics needs.
  • Understand the Limitations: For problems that neither humans nor AI have solved, it remains unclear whether they're impossible or simply extremely difficult, meaning benchmark results don't guarantee AI will solve future unseen problems.

What Does This Mean for Scientific Research?

The implications of BioMysteryBench extend beyond a single benchmark score. The fact that Claude can solve problems humans cannot suggests that AI is beginning to function as a genuine research partner rather than just a tool for automating routine tasks. This is particularly significant in fields like bioinformatics, where the ability to recognize patterns in complex datasets can accelerate discovery.

However, Anthropic acknowledged that BioMysteryBench has limitations. For tasks that neither humans nor AI have successfully solved, researchers cannot be certain whether the problems are genuinely unsolvable or simply extremely difficult. This uncertainty means the benchmark provides a snapshot of current capabilities rather than a definitive measure of AI's scientific potential.

Looking forward, Anthropic stated that it plans to build more long-term, real-world scientific tasks to further push Claude's research capabilities. The organization also welcomed creative ideas from the research community for expanding the benchmark. As Claude models continue to improve with each generation, the gap between human and AI performance on scientific problems is likely to narrow further, potentially reshaping how bioinformatics research is conducted.

" }