Advanced Large Language Models Exhibit 20 Percent Diagnostic Failure Rate in Critical Neurological Imaging Study
Study finds GPT-5 and Gemini 3 Pro make critical errors in CT scan analysis. Experts warn against using conversational AI for medical diagnosis without oversight.
By: AXL Media
Published: Mar 17, 2026, 4:18 AM EDT
Source: Information for this report was sourced from New York Institute of Technology

The Growing Divide in Clinical Artificial Intelligence
As specialized artificial intelligence continues to integrate into modern healthcare, a critical distinction has emerged between task-specific algorithms and general-purpose large language models. While specialized systems are currently utilized in hospitals to flag retinal disease or early-stage lung cancer, the reliability of broader AI platforms like ChatGPT and Claude remains a subject of intense scrutiny. A new investigative study published in the journal Algorithms has highlighted the limitations of these multimodal systems when tasked with interpreting complex medical imagery. Researchers found that while these models are adept at linguistics, their lack of optimization for specific diagnostic tasks leads to a significant margin of error that could compromise patient safety in a real-world clinical setting.
A Comparative Analysis of Leading AI Models
The research team, led by Associate Professor Milan Toma from the New York Institute of Technology, subjected five of the world’s most advanced AI models to a rigorous diagnostic test. GPT-5, Gemini 3 Pro, Llama 4 Maverick, Grok4, and Claude Opus 4.5 Extended were all presented with the same CT brain scan exhibiting clear intracranial pathology. The models were instructed to perform the role of a radiologist, identifying the imaging technique, the location of the abnormality, and the primary diagnosis. Although the models initially appeared promising by correctly identifying the scan type, the subsequent analysis revealed a fundamental 20 percent failure rate across the board, exposing a lack of reliability in high-stakes medical assessments.
Critical Errors in Stroke Classification
The most alarming finding of the study involved the misclassification of an acute ischemic stroke. While four of the five models correctly identified the location of the blockage near the left middle cerebral artery, one model committed a catastrophic error by identifying the issue as a hemorrhage on the opposite side of the brain. Dr. Toma emphasized that such an error in a clinical environment would have devastating consequences, as ischemic and hemorrhagic strokes require fundamentally different and often contradictory treatment protocols. This inconsistency highlights the danger of relying on systems that prioritize authoritative-sounding explanations over verifie...
Categories
Topics
Related Coverage
- Frontier AI Models Invent Medical Details for X-Rays They Have Never Seen
- St. Jude Researchers Leverage Complex AI Prompting to Detect High-Risk Symptoms in Childhood Cancer Survivors
- Penn State Study Finds Jurors Nearly Fifty Percent More Likely to Penalize Physicians Who Disregard Correct AI Diagnostics
- Stanford Medicine Research Proves AI Chatbots Outperform Doctors in Clinical Management and Treatment Decisions