The Hidden Flaw in AI Drug Discovery: Why Better Data Matters More Than Better Algorithms
Artificial intelligence can only be as good as the data feeding it, and right now, most genomic databases are built on incomplete, messy information that leads researchers astray. That's the sobering message from Gonçalo Abecasis, senior vice president and chief genomics and data science officer at Regeneron Genetics Center, who warns that the promise of AI-driven drug discovery hinges on solving a problem that has nothing to do with algorithms.
Why Do We Need Tens of Millions of Genomes, Not Just Thousands?
The scale of genetic data required to identify viable drug targets is staggering. Regeneron's team has sequenced the genomes of about 3 million people with complete health records, yet they can only confidently assign functions to roughly 2,000 of the 20,000 genes humans possess. To unlock the remaining genes as potential therapeutic targets, researchers estimate they'll need data from tens of millions of people.
This isn't just about quantity. The quality of that data determines whether AI systems learn real biology or statistical artifacts. Electronic health records, for example, are designed primarily for billing purposes, not scientific accuracy. A hospital might code a diagnosis differently depending on insurance coverage, creating noise that AI models can mistake for genuine biological signals.
"AI is not going to get you very far if you say, 'Here's two million people with data. What's interesting here? What should be a next drug?' But if you ask specific questions, they can really be huge time-savers," said Gonçalo Abecasis.
Gonçalo Abecasis, Senior Vice President and Chief Genomics and Data Science Officer, Regeneron Genetics Center
How Are Researchers Actually Using Genomic Data to Find Drug Targets?
The most successful AI-driven discoveries start with a clear biological hypothesis, not a fishing expedition through raw data. Regeneron's work on GPR75 for obesity and diabetes illustrates this approach. Researchers observed that roughly 1 in 5,000 people naturally lack a functional copy of the GPR75 gene. These individuals are consistently lighter in body weight and show roughly 50% lower rates of diabetes compared to the general population. That consistent pattern across hundreds of different genetic variations in the same gene provided strong evidence that blocking GPR75 would be safe and effective.
Similarly, the team identified CIDEB as a promising target for liver disease using the same logic: natural human experiments showing that people missing functional copies of the gene have better liver health outcomes. Once researchers have that kind of specific knowledge, they can then decide which therapeutic approach makes sense, whether that's an antibody, gene editing, or RNA interference.
Steps to Improve Genomic Data Quality for AI Analysis
- Audit for Hidden Biases: Spend time understanding blind spots in your datasets, such as how electronic health records conflate clinical reality with billing codes. Researchers must actively identify and document these quirks before training AI models.
- Validate Against Natural Experiments: Look for consistent patterns across multiple independent genetic variations in the same gene. If hundreds of different mutations in one gene all produce the same health outcome, that's a strong signal for a real biological effect.
- Combine Multiple Data Types: Pair genomic sequences with epigenomic data, transcriptomics, and clinical outcomes. The Vanderbilt NeuroCline team, for example, is combining whole-genome sequences from 9,000 ALS patients with epigenomics, transcriptomics, and proteomics data from over 2,000 cases to identify motor neuron disease targets.
What New Discoveries Are Emerging From Large-Scale Genomic Studies?
Beyond drug discovery, massive genomic studies are rewriting our understanding of human ancestry and disease risk. Researchers at RIKEN's Center for Integrative Medical Sciences analyzed whole-genome sequences from more than 3,200 people across Japan and discovered evidence for a previously overlooked third ancestral group linked to the ancient Emishi people of northeastern Japan. This finding challenges the long-accepted "dual origins" theory that had dominated the field for decades.
The study also uncovered something medically significant: inherited DNA from Neanderthals and Denisovans that still affects modern Japanese populations. Researchers identified 44 archaic DNA regions in modern Japanese genomes, including a Denisovan-derived segment in the NKX6-1 gene associated with type 2 diabetes that may influence how some patients respond to semaglutide treatments. They also found 11 Neanderthal-derived genetic segments connected to coronary artery disease, prostate cancer, and rheumatoid arthritis.
"The Japanese population isn't as genetically homogenous as everyone thinks. Our analysis revealed Japan's subpopulation structure on a fine scale, which is very beautifully classified according to geographical locations in the country," explained Chikashi Terao.
Chikashi Terao, Lead Researcher, RIKEN Center for Integrative Medical Sciences
How Is Technology Turning Raw Genetic Data Into Clinical Insights?
The journey from a DNA sample to a medical report involves multiple layers of technology working in concert. Modern sequencing platforms read nearly all 3 billion DNA base pairs in a person's genome, generating massive datasets in hours. But the real challenge begins after sequencing: interpreting what those billions of data points actually mean.
Bioinformatics platforms analyze raw genetic data to detect variations, typically running on cloud computing infrastructure that allows researchers to store and process enormous datasets. These platforms compare identified variants against reference databases like ClinVar and gnomAD, which have been built over years of research and help clarify whether a variant is harmless, clinically relevant, or still uncertain.
Artificial intelligence is increasingly playing a defined role in this interpretation process. AI and machine learning agents identify patterns in large datasets faster than humans could manually, and they can detect mutations with potential medical relevance by sifting through vast amounts of scientific literature. This proves especially crucial for complicated conditions like rare disorders or cancers, where identifying the most appropriate genetic alterations early can be critical.
One important insight: genetic interpretation isn't static. As new data is added to global databases, understanding of diseases and their genetic links evolves. A finding that appears unclear today may become actionable in the future once new supporting evidence emerges. This means genetic reports can be revisited and reinterpreted over time as technology improves.
What Major Funding Is Driving AI-Powered Genomics Research?
Recognition of genomics' potential is translating into significant investment. Vanderbilt NeuroCline, a consortium of researchers from Vanderbilt Health and Vanderbilt University, recently received a prestigious award to explore using artificial intelligence to find new drug targets for amyotrophic lateral sclerosis (ALS), the most common form of motor neuron disease. The Longitude Prize on ALS is a £7.5 million global challenge prize that rewards cutting-edge AI-based approaches to drug discovery.
Vanderbilt NeuroCline was among 20 teams selected to receive "Discovery Awards" of £100,000 each, based on their potential to use AI to identify and validate drug targets. The team, led by Veronique Belzil, director of the Vanderbilt ALS Research Center, and Bennett Landman, director of the Vanderbilt Lab for Immersive AI Translation (VALIANT), is structured around milestones that progressively transform raw whole-genome, epigenomic, and transcriptomic data into biologically interpretable, therapeutically actionable ALS targets.
The consortium now has access to the largest and most comprehensive ALS patient dataset of its kind, combining multiple types of biological information that have never been available in one place previously. This includes genomic sequences from 9,000 ALS patients and epigenomics, transcriptomics, and proteomics data for over 2,000 cases. Next year, 10 teams will progress to a second stage, receiving an additional £200,000 to build evidence for their proposed therapeutic targets. In 2028, five teams will receive £500,000 to undertake validation in the laboratory.
The broader shift in genomics research reflects a recognition that most large genomic databases have historically focused on people of European ancestry, limiting scientists' understanding of disease risk in other populations. Expanding databases like Japan's JEWEL (Japanese Encyclopedia of Whole-Genome/Exome Sequencing Library) and India's GenomeIndia Project, which aims to sequence up to one million genomes in the coming years, will create more diverse genomic resources that benefit global healthcare.