OpenAI's o1 Model Outperforms Doctors on Clinical Reasoning, But Here's Why Hospitals Won't Replace Physicians Yet

FrontierNews.ai AI Research Desk

OpenAI's o1 Model Outperforms Doctors on Clinical Reasoning, But Here's Why Hospitals Won't Replace Physicians Yet

OpenAI's o1 series large language model (LLM) outperformed physicians across several clinical reasoning tasks in a new study, including rapid assessment, medication recommendations, and overall case management. However, researchers and medical leaders emphasize that the findings represent a proof of concept for AI as a diagnostic aid, not evidence that artificial intelligence is ready to practice medicine independently or replace human doctors.

What Did the Harvard-Stanford Study Actually Test?

Researchers at Harvard Medical School and Stanford University compared OpenAI's o1 model against physicians with varying levels of experience on clinical reasoning tasks. The study evaluated 76 real emergency room cases from a Boston hospital, examining how both the AI model and human doctors performed at three critical decision points: initial triage upon arrival, first contact with a physician, and admission to the medical floor or intensive care unit.

The cases included unstructured clinical data pulled directly from electronic health records, mirroring the high-stakes, time-sensitive decisions that emergency medicine professionals make with incomplete information. Two independent physicians evaluated the model's assessments without knowing whether they came from the AI system or from attending physicians. The o1 model matched or exceeded human performance across each stage, with the widest performance gap appearing at initial triage, where clinicians have the least information available.

Why Can't AI Replace Doctors Based on These Results?

The study's findings have sparked significant media attention, but medical experts caution against overstating what the results actually mean for clinical practice. The key limitation: the study only evaluated text-based diagnostic reasoning, not the full complexity of real-world medicine.

"The study is a validation of the diagnostic performance of these models. The basic claim is that the diagnostic performance of the models is not just an artifact of the evaluation mechanisms like vignettes, but holds with real clinical data. That does not mean that just deploying it makes a difference in patient care. It's more, like, watching over like a second set of eyes," said Adam Rodman, a general internist and medical educator at Beth Israel Deaconess Medical Center and assistant professor at Harvard Medical School.
Adam Rodman, MD, MPH, FACP, General Internist and Medical Educator at Beth Israel Deaconess Medical Center

Clinical practice involves sensory information that AI systems currently cannot process effectively. Physicians rely on physical examinations, observing patient behavior, listening to subtle changes in voice, and integrating information from multiple sources in real time. Current foundational models struggle with reasoning over non-text inputs like imaging, audio, or visual cues.

"There's a lot of desire to say that we could use technology to replace doctors, but that's not at all what I think it's capable of. The models are very capable in assisting with diagnosis, but they are not good at integrating information from many different sources. Doing a physical exam, I talk to a patient, look at them, hear the hesitation in their voices. I'm situated in the room with them. Not just getting information from them directly, looking at the medical record, calling other people, and doing an investigation. The LLMs are really good at integrating information that a physician curates from other sources, or they're really good at collecting that information verbally from a patient. They are not at all good at all those other parts of the diagnostic process, which are just as important," Rodman explained.
Adam Rodman, MD, MPH, FACP, General Internist and Medical Educator at Beth Israel Deaconess Medical Center

What Are the Real Limitations of This Study?

While the research included a substantial number of cases, several important constraints limit how broadly the findings apply to medical practice. The study focused exclusively on internal medicine and emergency care, which represent only a fraction of clinical specialties. Different medical fields require different skill sets, and the model's performance may vary significantly depending on diagnosis type, patient characteristics, or practice location.

Additionally, the study evaluated a specific task: providing second opinions at predefined clinical touchpoints. Real emergency medicine decisions center on triage, patient disposition, and immediate management rather than diagnostic accuracy alone. The cases were also carefully curated and cleaned up by clinicians, which can overstate AI performance compared to the messy, unstructured data encountered in actual clinical workflows.

Study Scope: Research focused only on internal medicine and emergency care, not representative of broader medical practice across multiple specialties.

Data Limitations: Cases were carefully curated and cleaned by clinicians, potentially overstating AI performance compared to real-world messy data in clinical workflows.

Missing Sensory Information: The study evaluated only text-based reasoning; clinical practice requires physical exams, visual cues, and auditory signals that current AI models cannot process effectively.

Integration Challenges: AI models struggle to integrate information from multiple sources the way physicians do, including patient observation, medical records, and consultation with other providers.

How Should Hospitals Actually Use AI Reasoning Models?

Rather than replacing physicians, medical leaders see AI reasoning models like o1 as tools to support clinical decision-making and reduce diagnostic errors. The appropriate path forward involves carefully controlled clinical trials in multiple real care settings, similar to how the medical field evaluates any new intervention.

"AI medical tools can help make diagnosis and be an adjunct to help provide faster care but cannot replace clinical gestalt which comes from years of clinical practice and more from treating patients by observing subtle signs. AI is very useful in tedious tasks and may help decrease workload for physicians by generating notes and being able to sift through patients' health summary to make relevant data available for clinicians for making diagnosis and treatment plans. There is always the human touch in healing, which AI would not be able to provide," noted Jasmeen Pombra, assistant chief of Hospital Operations and chief of Neuro Hospital Medicine at Kaiser Permanente.
Jasmeen Pombra, Assistant Chief of Hospital Operations and Chief of Neuro Hospital Medicine at Kaiser Permanente

Future research must examine how reasoning models might be safely integrated into clinical workflows, particularly by developing new benchmarks and trials that assess AI performance on multimodal data including imaging, audio, and real-time patient interaction. The goal is not to automate diagnosis, but to create a collaborative system where AI handles information synthesis and pattern recognition while physicians provide clinical judgment, patient care, and the human elements of healing.

Your AI & Tech News Engine

Breaking News

Anthropic's Claude Pricing Gets Complicated: What Users Actually Pay vs. What's Advertised

Why Tech Giants Are Burning Natural Gas Instead of Keeping Net-Zero Promises

OpenAI Embeds Hidden Watermarks in AI-Generated Voice to Fight Deepfakes

Elon Musk's Memphis AI Data Center Faces $137 Million Lien Over Construction Dispute

NVIDIA's Open-Source AI Models Could Reshape How Autonomous Vehicles Learn to Think

Amazon's Zoox Gets the Green Light: What It Means for the Future of Driverless Taxis

Federal Judge Rejects xAI's Emergency Bid to Block Minnesota's Nudification Ban

Federal Judge Signals Anthropic Will Win Again Against Trump Administration's Claude Ban

OpenAI's o1 Model Outperforms Doctors on Clinical Reasoning, But Here's Why Hospitals Won't Replace Physicians Yet

What Did the Harvard-Stanford Study Actually Test?

Why Can't AI Replace Doctors Based on These Results?

What Are the Real Limitations of This Study?

How Should Hospitals Actually Use AI Reasoning Models?