OpenAI's o1 Model Outperforms Doctors on Clinical Reasoning, But Here's Why Hospitals Won't Replace Physicians Yet
OpenAI's o1 series large language model (LLM) outperformed physicians across several clinical reasoning tasks in a new study, including rapid assessment, medication recommendations, and overall case management. However, researchers and medical leaders emphasize that the findings represent a proof of concept for AI as a diagnostic aid, not evidence that artificial intelligence is ready to practice medicine independently or replace human doctors.
What Did the Harvard-Stanford Study Actually Test?
Researchers at Harvard Medical School and Stanford University compared OpenAI's o1 model against physicians with varying levels of experience on clinical reasoning tasks. The study evaluated 76 real emergency room cases from a Boston hospital, examining how both the AI model and human doctors performed at three critical decision points: initial triage upon arrival, first contact with a physician, and admission to the medical floor or intensive care unit.
The cases included unstructured clinical data pulled directly from electronic health records, mirroring the high-stakes, time-sensitive decisions that emergency medicine professionals make with incomplete information. Two independent physicians evaluated the model's assessments without knowing whether they came from the AI system or from attending physicians. The o1 model matched or exceeded human performance across each stage, with the widest performance gap appearing at initial triage, where clinicians have the least information available.
Why Can't AI Replace Doctors Based on These Results?
The study's findings have sparked significant media attention, but medical experts caution against overstating what the results actually mean for clinical practice. The key limitation: the study only evaluated text-based diagnostic reasoning, not the full complexity of real-world medicine.
"The study is a validation of the diagnostic performance of these models. The basic claim is that the diagnostic performance of the models is not just an artifact of the evaluation mechanisms like vignettes, but holds with real clinical data. That does not mean that just deploying it makes a difference in patient care. It's more, like, watching over like a second set of eyes," said Adam Rodman, a general internist and medical educator at Beth Israel Deaconess Medical Center and assistant professor at Harvard Medical School.
Adam Rodman, MD, MPH, FACP, General Internist and Medical Educator at Beth Israel Deaconess Medical Center
Clinical practice involves sensory information that AI systems currently cannot process effectively. Physicians rely on physical examinations, observing patient behavior, listening to subtle changes in voice, and integrating information from multiple sources in real time. Current foundational models struggle with reasoning over non-text inputs like imaging, audio, or visual cues.
"There's a lot of desire to say that we could use technology to replace doctors, but that's not at all what I think it's capable of. The models are very capable in assisting with diagnosis, but they are not good at integrating information from many different sources. Doing a physical exam, I talk to a patient, look at them, hear the hesitation in their voices. I'm situated in the room with them. Not just getting information from them directly, looking at the medical record, calling other people, and doing an investigation. The LLMs are really good at integrating information that a physician curates from other sources, or they're really good at collecting that information verbally from a patient. They are not at all good at all those other parts of the diagnostic process, which are just as important," Rodman explained.
Adam Rodman, MD, MPH, FACP, General Internist and Medical Educator at Beth Israel Deaconess Medical Center
What Are the Real Limitations of This Study?
While the research included a substantial number of cases, several important constraints limit how broadly the findings apply to medical practice. The study focused exclusively on internal medicine and emergency care, which represent only a fraction of clinical specialties. Different medical fields require different skill sets, and the model's performance may vary significantly depending on diagnosis type, patient characteristics, or practice location.
Additionally, the study evaluated a specific task: providing second opinions at predefined clinical touchpoints. Real emergency medicine decisions center on triage, patient disposition, and immediate management rather than diagnostic accuracy alone. The cases were also carefully curated and cleaned up by clinicians, which can overstate AI performance compared to the messy, unstructured data encountered in actual clinical workflows.
How Should Hospitals Actually Use AI Reasoning Models?
Rather than replacing physicians, medical leaders see AI reasoning models like o1 as tools to support clinical decision-making and reduce diagnostic errors. The appropriate path forward involves carefully controlled clinical trials in multiple real care settings, similar to how the medical field evaluates any new intervention.
"AI medical tools can help make diagnosis and be an adjunct to help provide faster care but cannot replace clinical gestalt which comes from years of clinical practice and more from treating patients by observing subtle signs. AI is very useful in tedious tasks and may help decrease workload for physicians by generating notes and being able to sift through patients' health summary to make relevant data available for clinicians for making diagnosis and treatment plans. There is always the human touch in healing, which AI would not be able to provide," noted Jasmeen Pombra, assistant chief of Hospital Operations and chief of Neuro Hospital Medicine at Kaiser Permanente.
Jasmeen Pombra, Assistant Chief of Hospital Operations and Chief of Neuro Hospital Medicine at Kaiser Permanente
Future research must examine how reasoning models might be safely integrated into clinical workflows, particularly by developing new benchmarks and trials that assess AI performance on multimodal data including imaging, audio, and real-time patient interaction. The goal is not to automate diagnosis, but to create a collaborative system where AI handles information synthesis and pattern recognition while physicians provide clinical judgment, patient care, and the human elements of healing.