Logo
FrontierNews.ai

The Hidden Bottleneck in AI Antibody Design: Why Better Data Matters More Than Better Models

The field has invested heavily in building better AI models for antibody discovery, but the structural data those models depend on is falling dangerously behind. While protein-folding AI like AlphaFold has reset expectations across structural biology, a practical problem persists that is becoming increasingly difficult to ignore: the antibody-antigen structural data these tools rely on is not keeping up with the demands being placed on it.

Why Is Antibody Training Data So Limited?

The numbers tell a stark story. There are more than 240,000 protein structures in the Protein Data Bank, but only approximately 1,800 are antibody-antigen pairs. This massive imbalance means that AI models perform well at predicting how individual proteins fold, but often miss when predicting those same proteins bound to an antibody.

The scarcity is compounded by how those structures were generated in the first place. Most were produced to answer specific biological or therapeutic questions, which means they cluster around targets that were experimentally tractable, structurally stable, and scientifically compelling enough for a research group to invest in solving. While this selectivity is understandable, it produces a training dataset with significant blind spots.

"If the training set is biased toward relatively well-behaved proteins, common interaction geometries, or stabilized experimental systems, the model can appear strong in benchmark settings but become less reliable when applied to new targets or more complex discovery problems," said Dan Benjamin, co-founder and chief technology officer of Immuto Scientific.

Dan Benjamin, Co-founder and Chief Technology Officer, Immuto Scientific

The consequence is a dataset that is scientifically useful but deeply uneven. Well-behaved proteins, common interaction geometries, and stabilized experimental systems are overrepresented. Difficult targets, conformationally flexible antigens, non-binding antibodies, and non-functional binders are largely absent. The field often refers to this as the missing "negative" data problem, a training set that shows models what productive interactions look like but gives them limited exposure to the cases where binding fails and why.

What Happens When AI Models Encounter New Targets?

This data limitation creates a real-world problem that is difficult to detect computationally and expensive to discover experimentally. A model that performs well on benchmarks constructed from public data may generate plausible-looking structures for new targets while being systematically wrong about the actual binding mode. In practice, this means a discovery team may spend time optimizing around an incorrect structural hypothesis, wasting resources and delaying progress toward viable drug candidates.

The failure modes that emerge from structural data limitations are distributed unevenly across target classes. Conformational epitopes, where the antibody's binding site is defined by the three-dimensional shape of a folded protein surface rather than a contiguous sequence, are among the most difficult. Models trained primarily on linear epitope structures may not have enough relevant examples to generalize to these more complex binding scenarios.

How Can Better Data Fix This Problem?

Capturing structural diversity in antibody-antigen interaction data requires a genuinely broad range of antigens, epitopes, paratope geometries, complementarity-determining region loop conformations, scaffold types, binding orientations, affinities, and dynamic interaction states. The dynamic dimension is particularly important and particularly underrepresented in current databases.

Conventional structural biology methods like X-ray crystallography and cryo-electron microscopy provide extraordinary resolution but require conditions that can move proteins away from their biological context. Crystallization demands a stable, homogeneous structure, and cryo-EM samples are flash-frozen. The result is a snapshot of a complex that may have been stabilized, engineered, or concentrated in ways that alter what is being observed. For targets that are flexible, membrane-associated, or conformationally heterogeneous, those conditions can systematically exclude the interaction states most relevant to therapeutic function.

"When structural diversity is limited, models can still perform well within familiar territory. The problem emerges when they are asked to generalize," explained Benjamin.

Dan Benjamin, Co-founder and Chief Technology Officer, Immuto Scientific

The good news is that even modest amounts of experimentally anchored interaction data can help. Research shows that sometimes as few as 20 antibody-antigen pairs can help distinguish which structure is biologically plausible, allowing models to rank the correct answer higher among the set of possibilities they generate.

Steps to Improve AI Antibody Discovery Through Better Data

  • Expand Structural Diversity: Systematically generate antibody-antigen structures across a wider range of targets, epitope types, and binding modes rather than focusing only on scientifically tractable or commercially promising targets.
  • Capture Dynamic States: Develop methods to capture the dynamic, flexible interaction states that are most relevant to therapeutic function, not just the stable snapshots that conventional structural biology methods produce.
  • Include Negative Data: Deliberately generate and include data on non-binding antibodies and failed interactions, giving models exposure to what unsuccessful binding looks like and why it fails.
  • Anchor Models to Experimental Reality: Validate computational predictions with even small amounts of experimental data early in the discovery process to catch systematic errors before they derail entire programs.

The field took time to recognize this data bottleneck because the first wave of progress in protein AI was so impressive. It was natural to focus on architectures, scale, and compute power. But as Benjamin noted, the field is now reaching the point where the data layer is becoming a more visible constraint on what AI can reliably accomplish in real discovery settings.

Meanwhile, other organizations are tackling the broader challenge of connecting AI-driven workflows across the entire drug discovery pipeline. Insilico Medicine, a clinical-stage biotechnology company, has built an integrated platform called Pharma.AI that combines Biology42 for target discovery, Chemistry42 for molecular design, and Medicine42 for clinical insight. The company has rapidly progressed 30 developmental candidates, with 13 advancing to the clinical stage, including multiple Phase I and Phase II trials.

Looking ahead, Insilico aims to nominate 40 to 50 preclinical candidates and complete the industry's first Phase III clinical trial for a therapeutic discovered utilizing AI within the next two to three years. The company's founder and CEO, Alex Zhavoronkov, was recently recognized in the inaugural SCW75 list by Scientific Computing World for pioneering AI-driven drug discovery and longevity research.

The broader lesson is clear: as AI models become more sophisticated, the quality of the data infrastructure behind them becomes the defining constraint on what they can reliably accomplish. Whether a model works in the real world of drug discovery is not decided at the modeling stage, but by the quality of the data infrastructure supporting it. That makes the data factory central to the science, not incidental to it.