Harvard Study Reveals ChatGPT's Surprising Weakness: Why It Fails at Real Science
ChatGPT significantly underperformed graduate students in a Harvard study, scoring roughly two letter grades lower on molecular biology exams. Researchers expected OpenAI's chatbot to struggle with critical thinking, but the results revealed a more fundamental problem: the AI model couldn't reliably remember, apply, or interpret scientific information, even when researchers tried to improve its performance with better prompts .
How Did Researchers Test ChatGPT Against Real Students?
Harvard researchers conducted the study using assignments from the university's Principles of Molecular Biology course, a 200-level class that spans an entire semester. To ensure students hadn't used artificial intelligence themselves, the researchers selected out-of-class assignments from 2022, before AI tools became widely available. They tested ChatGPT using GPT-4o, the model OpenAI released in May 2024 .
The doctoral students in the study were expected to maintain a minimum grade of 80 percent, which represents a passing grade at the graduate level. This provided a clear benchmark for comparing human and AI performance on the same material.
Where Did ChatGPT Struggle Most?
The performance gap widened dramatically depending on the type of question. Here's how ChatGPT performed across different cognitive levels :
- Memory-Based Questions: ChatGPT scored 82 percent compared to students' 98 percent, showing the AI could recall basic information but not as reliably as humans.
- Understanding and Application: ChatGPT averaged 66 percent on questions requiring understanding, application, and analysis, while doctoral students scored 87 percent, a significant 21-point gap.
- Graph and Data Interpretation: Researchers found a striking deficit in ChatGPT's ability to interpret scientific graphs and raw data in both short-answer and multiple-choice questions, even when using a version specifically designed for image interpretation.
- Experimental Design: ChatGPT performed particularly poorly on questions asking students to identify, rationalize, and describe experimental controls they had learned through coursework.
The researchers noted that ChatGPT would have "failed" the course based on these results. The poor performance was largely driven by the algorithm's markedly poor performance on the "apply" level, which refers to identifying and describing experimental controls .
What Does This Mean for AI in Education and Research?
The Harvard findings challenge the widespread perception that large language models, or LLMs (AI systems trained on vast amounts of text data), have reached near-expert levels of competence. While some tech companies and marketers have suggested that ChatGPT operates at a doctoral level, this study provides concrete evidence that the reality is more nuanced .
"We found a striking deficit in ChatGPT's ability to interpret scientific graphs and raw data in both short-answer and multiple-choice questions, even when using a version specifically designed for image interpretation," the researchers wrote.
Harvard Researchers, Principles of Molecular Biology Study
The implications extend beyond a single exam. If ChatGPT struggles to apply learned concepts and interpret scientific data, it raises questions about its reliability for tasks that require more than pattern recognition or information retrieval. This is particularly important in fields like medicine, engineering, and scientific research, where applying knowledge to novel problems is essential .
Is This Study Outdated Already?
Some observers on Reddit's r/science forum pointed out that the ChatGPT model used in the experiment may not represent the current state of AI technology. One commenter noted that LLMs have improved dramatically even in the past year, suggesting that the gap between ChatGPT and human performance may have narrowed since the study was conducted .
"I think it's critical to point out that when this study was done, LLMs like ChatGPT were nowhere near where they are now. As someone who uses LLMs daily and runs a significant research group, we have found that the difference between now and even just one year ago is an order of magnitude," one researcher commented.
Reddit Commenter, r/science Forum
However, other commenters expressed skepticism about claims that LLMs have reached expert-level performance. One critic observed that ChatGPT and similar models consistently fail to solve introductory physics and chemistry questions, suggesting that research-level biology would be even more challenging .
What This Reveals About AI Limitations
The Harvard study highlights a fundamental distinction between different types of cognitive tasks. Large language models excel at tasks that involve retrieving or summarizing information they've seen during training. They struggle when they need to apply that knowledge to new situations, interpret visual data, or reason through complex experimental designs .
Anyone who has spent significant time using LLMs knows they require careful oversight. As one Reddit commenter noted, even for focused tasks like coding, users need to watch out for hallucinations, or instances where the AI confidently generates false information, and other problematic practices .
The takeaway isn't that ChatGPT is useless for scientific work. Rather, it's that the technology has real limitations that users need to understand. Students, researchers, and professionals using these tools should view them as assistants that can help with certain tasks, like drafting initial outlines or explaining concepts, but not as replacements for human expertise, especially when it comes to applying knowledge to novel problems or interpreting complex data.
" }