New Voting System Eliminates AI Hallucinations in Medical Chatbot Tests
A new verification method developed by Binghamton University researchers could eliminate the false information that AI chatbots confidently generate when answering medical questions. The breakthrough uses multiple AI models that "vote" on the correct answer, achieving zero hallucinations in controlled testing of medical terminology.
Why Are AI Chatbots Spreading Medical Misinformation?
As people increasingly turn to AI chatbots like ChatGPT for health advice, the stakes of accuracy have never been higher. Last year, Binghamton University researchers tested OpenAI's ChatGPT and found it performed well at identifying disease terms, drug names, and genetic information. However, the same chatbot also generated a troubling number of false "hallucinations," confidently delivering made-up information as if it were fact.
This problem is particularly dangerous in healthcare, where incorrect medical information could lead patients to misdiagnose symptoms or delay seeking proper care. The challenge lies in the nature of how large language models (LLMs), the AI systems powering chatbots, work. These models generate responses word-by-word based on patterns in training data, sometimes producing plausible-sounding but entirely fabricated information.
How Does the New Verification System Work?
Ahmed Abdeen Hamed, a research fellow at Binghamton University, and George J. Klir Professor of Systems Science Luis M. Rocha developed an innovative solution funded by a $100,000 grant from New York state's Empire AI Consortium. Their protocol harnesses the power of multiple open-source AI models working together.
The system works by having seven different large language models receive the same plain-language medical symptoms. Each model then identifies what it believes are the correct medical terms, complete with official identification numbers. Crucially, the models use retrieval-augmented generation (RAG), a technique that requires them to reference an authoritative database of medical terminology before providing an answer, rather than relying solely on their training data.
After analyzing over 10,000 experiments, the results were striking: 76.85% of answers were supported by at least four of the seven chatbots, and the remaining 23.15% were supported by at least two. Most importantly, no unmatched terms emerged, and no hallucinations appeared in the verified answers.
"The new workflow is incredible because it can verify anything from a biomedical point of view, biological knowledge with disease and genetics, translational knowledge from diseases to treatments and clinical trials, and also from a healthcare point of view with symptoms and treatments," said Ahmed Abdeen Hamed.
Ahmed Abdeen Hamed, Research Fellow, Binghamton University
Steps to Implement AI Verification in Healthcare Settings
- Deploy Multiple Models: Use seven or more open-source large language models simultaneously rather than relying on a single AI system, creating redundancy and cross-verification of medical information.
- Enforce Database Referencing: Require all models to use retrieval-augmented generation, forcing them to check authoritative medical databases before generating responses instead of relying purely on learned patterns.
- Establish Voting Thresholds: Set confidence levels where answers must be supported by a majority of models, at least four out of seven, before being presented to patients or healthcare providers.
- Conduct Repeated Testing: Run the verification protocol multiple times with randomly selected models from a larger pool to continuously increase confidence in the accuracy of results.
One major advantage of this approach is its scalability. "There can be 100 large language models that are open source, and every time we can perform an experiment with seven LLMs selected at random from that list," Hamed explained. "When we perform the experiment many, many times, we increase the confidence in the voting".
What Makes This Different From Current AI Healthcare Applications?
While AI is already transforming healthcare in multiple ways, most current applications focus on specific tasks like radiology image analysis or clinical note-taking. UCLA Health, for example, has deployed AI-powered chatbots that provide patients with rapid information about specialists and scheduling, along with "scribe" technology that converts doctor-patient conversations into clinical notes.
However, the Binghamton team's verification protocol addresses a fundamental problem that affects all AI systems: the tendency to generate plausible-sounding but false information. This is especially critical for patient-facing applications where medical accuracy directly impacts health decisions.
"This protocol is a big step toward the democratization of knowledge verification," said Ahmed Abdeen Hamed.
Ahmed Abdeen Hamed, Research Fellow, Binghamton University
The protocol's applications extend beyond medical terminology. Luis M. Rocha, who collaborated on the research, noted that it can extract and verify evidence for adverse drug reactions from clinical trials, scientific literature, pharmacological databases, and even social media discourse. The team has already begun piloting multi-layer models of ER+ breast cancer, demonstrating the system's potential for precision medicine.
Could This Approach Work Beyond Healthcare?
While the study focused on biomedical applications, the researchers believe their discovery could eliminate other kinds of LLM hallucinations as well. The same voting protocol could theoretically curb fabricated legal citations, fake academic references, or historical inaccuracies in any field where accuracy is critical.
As healthcare systems nationwide grapple with how to safely integrate AI into patient care, this verification method offers a practical pathway forward. By combining multiple AI models with mandatory database referencing and consensus voting, healthcare providers could offer patients AI-powered information they can actually trust.
The research was recently published in STAR Protocols, a peer-reviewed journal focused on reproducible research methods. Hamed has since transitioned to a new role as a research associate professor at the University of Nebraska-Lincoln, where he plans to continue advancing responsible AI development.