AI Just Cracked How DNA Controls Gene Splicing from Thousands of Bases Away
A new artificial intelligence system can now predict how distant DNA sequences influence gene splicing with unprecedented accuracy, potentially transforming how scientists understand genetic diseases and design new treatments. Researchers at the University of Tokyo created SpliceSelectNet (SSNet), a deep learning model that analyzes DNA sequences spanning up to 100,000 base pairs while maintaining single-nucleotide resolution, enabling it to detect regulatory signals located thousands of bases away from the splice sites they control.
Why Does Long-Range DNA Analysis Matter for Human Health?
RNA splicing is the process by which cells remove and rejoin segments of genetic code to produce functional proteins. Errors in this process contribute to genetic diseases, cancer, and other serious conditions. However, predicting how DNA variations affect splicing has remained a major challenge because many regulatory signals sit far from the actual splice sites they influence. Existing artificial intelligence models struggle with this distance problem, limiting scientists' ability to understand disease-causing mutations and develop targeted treatments.
The breakthrough matters because most current computational tools rely on convolutional neural networks, which can only capture regulatory signals within a limited range of the splice site. This means they miss critical information encoded thousands of base pairs away. SSNet solves this by using a hierarchical Transformer architecture, a type of deep learning model originally developed for language processing, but redesigned specifically for DNA's unique properties.
How Does SSNet Analyze Such Long DNA Sequences Efficiently?
The key innovation lies in how SSNet handles the computational challenge of processing extremely long sequences. Instead of analyzing the entire 100,000-base-pair stretch at once, which would be computationally prohibitive, the model divides long DNA sequences into smaller blocks. It then analyzes local patterns within each block and integrates information across the entire sequence through a hierarchical attention process. This two-level approach preserves the dense attention needed for accuracy while remaining computationally efficient.
The researchers also enabled visualization of attention scores, allowing them to identify which DNA regions the model considered important during prediction. This transparency is crucial because it helps bridge the gap between predictive accuracy and biological interpretability, showing that the regions highlighted by the model closely correspond to biologically meaningful regulatory elements.
Steps to Understand SSNet's Practical Applications
- Variant Interpretation: SSNet can screen variants in non-coding regions that currently have uncertain significance, helping clinicians and researchers determine whether genetic mutations are likely to cause disease by affecting RNA splicing patterns.
- Disease Research: The model maintains sensitivity to regulatory signals located many thousands of base pairs from affected splice sites, enabling researchers to understand how distant mutations contribute to genetic diseases and cancer through aberrant splicing.
- Drug Development: In pharmaceutical research, the approach could assist in designing oligonucleotide therapeutics that target abnormal splicing, offering a new strategy for treating splicing-related disorders.
When tested against leading splice prediction systems across multiple large genomic datasets, SSNet achieved state-of-the-art performance for both splice site prediction and aberrant splicing detection. In simulations using the DMD gene and evaluations of pathogenic variants from ClinVar, a database of genetic variants, the model demonstrated its ability to capture the effects of distant regulatory sequences beyond the effective range of conventional approaches.
"The key achievement of this work is that we successfully modeled ultra-long-range genomic interactions while preserving high computational efficiency and single-nucleotide resolution," said Prof. Kenta Nakai, from the Human Genome Center at the University of Tokyo.
Prof. Kenta Nakai, Human Genome Center, Institute of Medical Science, University of Tokyo
The study, published in Nucleic Acids Research on June 22, 2026, was conducted by Prof. Nakai and Ph.D. student Yuna Miyachi from the University of Tokyo. Their work demonstrates that hierarchical Transformer architectures could become valuable tools beyond splice site prediction, potentially supporting future research into promoter-enhancer interactions, three-dimensional genome organization, and broader DNA language models.
What Makes This Different from Previous AI Approaches to DNA?
Many existing artificial intelligence models for DNA analysis were adapted from natural language processing, treating DNA sequences similarly to how these systems process human language. However, DNA has fundamentally different properties that require specialized architectural design. Ms. Miyachi explained the team's approach to addressing this mismatch.
"By redesigning the architecture to account for long-range genomic interactions and strict sequence resolution, we aimed to create a system better suited to biological reality," noted Ms. Yuna Miyachi, Ph.D. student in the Department of Computer Science at the University of Tokyo.
Ms. Yuna Miyachi, Ph.D. student, Department of Computer Science, University of Tokyo
The researchers also expect opportunities for collaboration with researchers in clinical and genomic medicine, where the technology could help advance precision genomic medicine by enabling accurate and interpretable analysis of genomic regions spanning up to 100,000 base pairs. By capturing long-range regulatory signals while maintaining single-nucleotide precision, SSNet provides a powerful new framework for studying RNA splicing and interpreting disease-associated variants.
This advancement comes as the broader epigenetics market, which includes technologies for analyzing gene regulation and DNA modifications, is experiencing significant growth. The global epigenetics market is projected to grow from $2.22 billion in 2026 to $3.92 billion by 2031, driven by increasing adoption of epigenomic profiling technologies and growing application of artificial intelligence in data interpretation. As AI-enabled tools like SSNet become more sophisticated, they are expected to accelerate research into gene regulation, disease mechanisms, and precision medicine applications.