A New Search Engine for Protein Structures Could Unlock Hidden Functions in Millions of AI-Predicted Proteins
A team at Seoul National University has created a search tool called Folddisco that can rapidly locate proteins with specific functions within hundreds of millions of structures predicted by artificial intelligence, achieving 20-fold faster searches while reducing storage requirements to one quarter of conventional approaches. The breakthrough, published in Nature Biotechnology, addresses a critical bottleneck in protein research: as AI tools like AlphaFold2 generate vast databases of predicted protein structures, scientists have struggled to efficiently search through them to find proteins with desired properties.
Why Finding Specific Proteins in Massive Databases Has Been So Difficult?
When Google DeepMind's AlphaFold2 arrived, it transformed protein science by predicting the three-dimensional structures of hundreds of millions of proteins in a short timeframe. But this abundance created a new problem: how do you search through millions of structures to find the ones that matter for your research or drug development project? Traditional search methods indexed proteins by their amino acid sequences, which is like trying to find a book in a library by reading every word on every page.
The key insight behind Folddisco is that protein function isn't determined by the order of amino acids alone, but by the three-dimensional shape they form. A "structural motif" is a small three-dimensional pattern where several amino acids cluster at specific angles and positions, functioning like a fingerprint that reveals what a protein does and whether it's active or inactive. By focusing on these geometric patterns rather than sequence order, researchers can identify proteins with similar functions even if their amino acid sequences differ dramatically.
How Does Folddisco Actually Work?
- Geometric Indexing: Instead of storing position information about amino acids in a sequence, Folddisco converts geometric information such as distances, angles, and orientations between neighboring amino acids into numerical form, dramatically reducing storage needs.
- Shape-Based Distinction: The software incorporates information about the orientation of amino acid side chains that influence protein function, allowing it to distinguish even subtle shape differences between structural motifs that other methods might miss.
- Smart Scoring System: Folddisco uses a "sparsity-based scoring" approach that assigns lower scores to common patterns and higher scores to rare patterns, enhancing search accuracy by prioritizing unusual and potentially more interesting structural features.
The performance improvements are substantial. Folddisco uses only one quarter of the storage space required by conventional methods, while achieving 20-fold faster structural motif searches and 11-fold faster index generation. For researchers working with massive protein databases, this translates to finding relevant structures in seconds rather than hours.
What Real-World Problems Could Folddisco Solve?
The Seoul National University team validated their software in practical tests that demonstrate its potential impact. They discovered a new structural motif called a "zinc finger" in proteins derived from oysters and wastewater whose functions had previously been unknown. They also successfully distinguished between the active and inactive states of GPCR protein receptors, which are located in cell membranes and are major targets for pharmaceutical development.
"At present, the tool can be used only for protein structure searches, but we plan to expand its scope to biomolecules such as nucleic acids and drugs that interact with proteins," said Professor Martin Steinegger, leader of the research team. "We aim to develop it into a tool for integrative analysis of complex biological phenomena."
Professor Martin Steinegger, Seoul National University
The researchers expect Folddisco to serve multiple purposes across the life sciences. Beyond elucidating the functions of unknown proteins, the tool can verify the originality of AI-designed proteins, develop artificial enzymes, and design drug candidates. Essentially, it functions as a "protein search engine" that rapidly retrieves structures similar to a desired structural motif from massive databases, enabling researchers to ask questions like "show me all proteins with this specific binding pocket" or "find proteins with this catalytic mechanism".
This development arrives at a critical moment in structural biology. As AlphaFold2 and similar AI tools continue generating predictions for billions of proteins, the ability to efficiently search and analyze these structures becomes increasingly valuable. Folddisco bridges the gap between prediction capability and practical utility, making it possible for researchers to leverage AI-generated protein databases in ways that were previously too computationally expensive or time-consuming to consider.
The tool represents a shift in how scientists approach protein discovery. Rather than designing experiments to test individual proteins one at a time, researchers can now computationally screen millions of structures to identify candidates most likely to have desired properties, then validate those predictions experimentally. This could accelerate development timelines for new therapeutics, industrial enzymes, and biomaterials.