Advanced Large Language Models Exhibit 20 Percent Diagnostic Failure Rate in Critical Neurological Imaging Study

Study finds GPT-5 and Gemini 3 Pro make critical errors in CT scan analysis. Experts warn against using conversational AI for medical diagnosis without oversight.

By: AXL Media

Published: Mar 17, 2026, 4:18 AM EDT

Source: Information for this report was sourced from New York Institute of Technology

Advanced Large Language Models Exhibit 20 Percent Diagnostic Failure Rate in Critical Neurological Imaging Study - article image

The Growing Divide in Clinical Artificial Intelligence

As specialized artificial intelligence continues to integrate into modern healthcare, a critical distinction has emerged between task-specific algorithms and general-purpose large language models. While specialized systems are currently utilized in hospitals to flag retinal disease or early-stage lung cancer, the reliability of broader AI platforms like ChatGPT and Claude remains a subject of intense scrutiny. A new investigative study published in the journal Algorithms has highlighted the limitations of these multimodal systems when tasked with interpreting complex medical imagery. Researchers found that while these models are adept at linguistics, their lack of optimization for specific diagnostic tasks leads to a significant margin of error that could compromise patient safety in a real-world clinical setting.

A Comparative Analysis of Leading AI Models

The research team, led by Associate Professor Milan Toma from the New York Institute of Technology, subjected five of the world’s most advanced AI models to a rigorous diagnostic test. GPT-5, Gemini 3 Pro, Llama 4 Maverick, Grok4, and Claude Opus 4.5 Extended were all presented with the same CT brain scan exhibiting clear intracranial pathology. The models were instructed to perform the role of a radiologist, identifying the imaging technique, the location of the abnormality, and the primary diagnosis. Although the models initially appeared promising by correctly identifying the scan type, the subsequent analysis revealed a fundamental 20 percent failure rate across the board, exposing a lack of reliability in high-stakes medical assessments.

Critical Errors in Stroke Classification

The most alarming finding of the study involved the misclassification of an acute ischemic stroke. While four of the five models correctly identified the location of the blockage near the left middle cerebral artery, one model committed a catastrophic error by identifying the issue as a hemorrhage on the opposite side of the brain. Dr. Toma emphasized that such an error in a clinical environment would have devastating consequences, as ischemic and hemorrhagic strokes require fundamentally different and often contradictory treatment protocols. This inconsistency highlights the danger of relying on systems that prioritize authoritative-sounding explanations over verifie...

Advanced Large Language Models Exhibit 20 Percent Diagnostic Failure Rate in Critical Neurological Imaging Study

Categories

Topics

Related Coverage