ChatGPT-4o Just Crushed Doctors on a Medical Knowledge Test. Here's What That Actually Means

Large language models (LLMs), which are AI systems trained on vast amounts of text to understand and generate human language, are now outperforming medical professionals on standardized clinical knowledge assessments. In a head-to-head comparison at a major medical conference, ChatGPT-4o and 12 other AI models significantly outscored 123 physicians and medical students on an acute kidney injury (AKI) knowledge test, raising important questions about how these tools should be integrated into clinical practice .

The study, conducted at the 131st Annual Congress of the German Society of Internal Medicine in Wiesbaden, Germany, in May 2025, tested 13 publicly available LLMs against a heterogeneous group of medical professionals. Both groups completed an identical assessment consisting of two clinical case vignettes and 15 multiple-choice questions focused on AKI, a common and serious complication affecting 10 to 15 percent of hospitalized patients and up to 50 percent of those in intensive care .

How Did the AI Models Actually Perform?

The results were striking. The 13 LLMs achieved a mean score of 13.5 out of 15 points, equivalent to 90 percent accuracy, with several models reaching a perfect score. In contrast, the 123 human participants averaged 7.3 out of 15 points, or 48.7 percent accuracy. Only 16.3 percent of the physicians and medical students scored 11 points or higher .

The speed advantage was equally dramatic. ChatGPT-4o completed the entire test in approximately 0.5 minutes, while human participants required a mean of 7.3 minutes. This efficiency gap underscores a fundamental difference in how AI systems process and retrieve medical knowledge compared to human cognition .

Which AI Models Were Tested, and How Do They Compare?

The study evaluated a diverse range of AI systems representing different architectures, training approaches, and deployment models. The tested LLMs included:

  • OpenAI Models: ChatGPT-4o, ChatGPT-4o-mini, ChatGPT-4.5, ChatGPT-4, ChatGPT o3-mini-high, and ChatGPT o3-mini (reasoning)
  • Anthropic Models: Claude 3.7 Sonnet, representing instruction-tuned systems designed for nuanced reasoning
  • Google Models: Gemini 2.0 Flash and Gemini 2.5 Pro Experimental, showcasing multimodal capabilities
  • Open-Source and Commercial Models: Mistral Small 3.1, DeepSeek V3-0324, DeepSeek R1, and Grok-3

All models were tested using their default settings via official user interfaces or application programming interfaces (APIs), which are standardized ways to access the models programmatically. Each model received the same prompt: "Please answer the questions and provide an overview of your answer options." Researchers submitted all case vignettes and multiple-choice questions in a single prompt and evaluated each model once .

The inclusion of both proprietary systems from major commercial providers like OpenAI and Google, as well as open-source research models, ensured the study captured a representative cross-section of currently available AI technology. This diversity strengthens the generalizability of the findings across the broader AI landscape .

Why Does This Matter for Patient Care?

Acute kidney injury is a critical clinical problem. It occurs in approximately 10 to 15 percent of hospital admissions and up to 50 percent of intensive care patients. Timely recognition and swift, accurate intervention are essential to prevent irreversible damage and progression to chronic kidney disease (CKD), which affects 11 to 15 percent of the global population and is associated with increased mortality, reduced quality of life, and substantial economic burden .

The ability of LLMs to rapidly access and apply medical knowledge could transform how clinicians support their decision-making at the point of care. Rather than spending time consulting textbooks or searching medical databases, physicians could use AI systems as rapid, cost-effective tools for clinical knowledge support. This is particularly valuable in emergency and inpatient settings where time-sensitive decisions can have life-or-death consequences .

However, the researchers emphasized a critical caveat: while LLMs demonstrated superior performance on this knowledge assessment, their role in real-world patient care remains undetermined. The study was designed to evaluate factual knowledge recall and application in a controlled, standardized format. Clinical practice involves far more than answering multiple-choice questions. It requires contextual judgment, understanding of individual patient circumstances, ethical reasoning, and the ability to integrate information from physical examination, imaging, and laboratory results .

What Are the Practical Implications for Clinicians?

The findings suggest several concrete ways that LLMs could enhance clinical workflows without replacing human judgment. These include:

  • Rapid Knowledge Retrieval: Clinicians can use AI systems to quickly access evidence-based information about AKI diagnosis, staging, and management, reducing the time spent searching medical literature or guidelines
  • Decision Support: LLMs can help clinicians organize and synthesize complex clinical information, presenting relevant diagnostic criteria and treatment options in a structured format
  • Educational Value: Medical students and residents can use AI systems to test their knowledge, identify knowledge gaps, and receive immediate feedback on their clinical reasoning
  • Documentation Assistance: AI models have demonstrated value in clinical documentation and automated report generation, potentially reducing administrative burden on physicians

The study also explored whether different prompting strategies could influence AI performance. Researchers conducted exploratory follow-up testing with three prompt variations designed to reflect neutral, role-based, and guideline-oriented instructions. This approach acknowledges that how clinicians phrase their questions to AI systems may affect the quality and relevance of the responses they receive .

What Are the Limitations and Open Questions?

The researchers were transparent about the study's scope and limitations. This was a single cross-sectional observational investigation with exploratory intent, conducted at a specific medical conference. It did not include longitudinal follow-up or external validation of the findings. The human participants were a heterogeneous group of internists and medical students attending a conference, which may not represent the full spectrum of clinical experience and expertise .

Additionally, the study evaluated performance on a standardized knowledge assessment, not on actual clinical decision-making with real patients. The gap between knowing the correct answer on a test and applying that knowledge in a complex, time-pressured clinical environment is substantial. Factors like patient communication, ethical considerations, resource constraints, and the integration of patient preferences into treatment decisions cannot be captured in a multiple-choice format .

The research highlights both the remarkable capabilities of modern LLMs and the essential role that human clinical judgment must continue to play. As these AI systems become more sophisticated and more widely available, the challenge for the medical profession will be to harness their strengths while maintaining the contextual, patient-centered approach that defines good clinical care. The future of medicine likely involves AI and humans working together, with AI handling rapid knowledge retrieval and synthesis, and physicians providing the judgment, empathy, and accountability that patients deserve .