How AI Is Learning to Describe Medical Images Like a Doctor Would
A new approach to medical image analysis is teaching artificial intelligence to describe what it sees in blood cell images using natural language, rather than just assigning them to categories. Instead of labeling a leukemia cell image as simply "AML" or "ALL," researchers are pairing images with short descriptive sentences that capture clinical meaning. This shift from simple classification to language-based description could make AI systems more interpretable and useful in real medical settings.
Why Are Doctors' AI Tools Still Making Basic Mistakes?
While computer vision has made remarkable strides in recent years, a troubling gap persists between what AI can do and what it can do reliably. Recent audits in Ontario found that AI note-taking systems used by physicians routinely fail at basic factual accuracy, even as the models demonstrate impressive capabilities in other areas. This capability-reliability gap reveals a fundamental challenge: AI systems can perform complex visual tasks while simultaneously stumbling on simple facts that any trained medical professional would catch instantly.
The problem extends beyond note-taking. Large language models (LLMs), which are AI systems trained on vast amounts of text data, can excel at creative tasks like recipe generation but fail at elementary conversions between teaspoons and tablespoons. This pattern suggests that current AI technology achieves surface-level competence without developing genuine understanding of the domains it operates in.
How Can Descriptive Language Improve Medical AI?
Researchers at Azad University of Tehran tackled this problem by rethinking how medical images are labeled. Rather than assigning each leukemia cell image to a single category, they created short descriptive sentences that capture clinical characteristics:
- AML (Acute Myeloid Leukemia): "Large immature blood cells with abnormal structure."
- CML (Chronic Myeloid Leukemia): "High number of abnormal white blood cells in blood cells."
- ALL (Acute Lymphoblastic Leukemia): "Fast-growing immature lymphocyte cells in the blood."
- CLL (Chronic Lymphocytic Leukemia): "Small abnormal lymphocyte cells with slow progression."
This approach forces the AI model to learn the relationship between visual features and clinical language simultaneously. The system uses both text-to-image and image-to-text learning, meaning it learns to connect images to descriptions and descriptions back to images.
"Instead of assigning only one class label, we connected each image to a short descriptive sentence. The interesting part of this project was trying to make AI understand medical images in a more human-like way using language," explained Mohammad Momenian, Research Assistant at Azad University of Tehran.
Mohammad Momenian, Research Assistant at Azad University of Tehran
The model achieved approximately 70% accuracy on this task. While this may not sound exceptional compared to traditional classification systems that reach 99% accuracy on simpler tasks, the researchers argue their approach is more practical and informative. A simple numerical label like "0," "1," "2," or "3" tells a clinician nothing about why the AI made its decision. A descriptive sentence, by contrast, provides reasoning that a doctor can evaluate.
What Makes This Approach Different from Traditional Medical AI?
Traditional leukemia classification systems assign each cell image to a single class. This binary approach is efficient but loses information. A descriptive sentence captures nuance that a single label cannot convey. The challenge, however, lies in creating text descriptions that are both simple enough for the model to learn from and detailed enough to be clinically useful.
The research team also had to address a common problem in medical AI: imbalanced datasets. Some leukemia types are rarer than others, meaning the AI has far fewer training examples for certain cell types. The researchers used generative adversarial networks (GANs), a type of AI that can create synthetic training data, to balance the dataset and improve the model's ability to recognize rare cell types.
This multimodal approach, which combines vision and language, opens a path toward AI systems that explain their reasoning in human terms. Rather than presenting a doctor with a confidence score or a category label, the system could say: "This cell shows characteristics of AML: large immature blood cells with abnormal structure." A clinician can then verify whether that description matches what they see under the microscope.
What Are the Real-World Implications for Healthcare AI?
The capability-reliability gap that plagues current AI systems is particularly dangerous in healthcare, where mistakes can have serious consequences. Auditors in Ontario documented cases where AI note-taking systems misrecorded patient information, creating potential safety hazards. These failures occurred not because the AI lacked capability, but because it lacked reliability in domains where accuracy is non-negotiable.
The descriptive sentence approach addresses this problem partially by making AI reasoning transparent. When an AI system must articulate its reasoning in language, it becomes easier for humans to catch errors before they propagate into clinical decisions. However, the researchers acknowledge that their current method is not perfect and requires improvement through larger datasets, better text descriptions, and stronger multimodal models.
The broader lesson is that advancing AI capability alone is insufficient. The field must simultaneously improve reliability, transparency, and the ability of AI systems to explain their decisions in terms that domain experts can understand and verify. For medical imaging, this means moving beyond black-box classification toward systems that reason about images in ways that align with how clinicians think.