Medical AI flaws exposed as models diagnose without images

A phenomenon has been confirmed in which artificial intelligence (AI) used in medical settings produces answers “as if it directly saw” imaginary images even without receiving actual diagnostic images.

A Live Science report cited by online outlet Gigazine on April 13 local time said a research team led by Mohammad Asadi (모하마드 아사디) at Stanford University reported similar problems in several visual AI models, including in healthcare.

The team presented only prompts about tissue samples, chest X-rays and brain MRI scans to AI, then compared cases where real images were provided with cases where they were not. The study covered 12 AI models. Many models, when images were absent, described images that did not exist and then offered a diagnosis or answer instead of saying, “There is no image.”

The phenomenon was particularly pronounced in healthcare. For pathology-image queries, AI outputs tended to skew toward severe diagnoses that would require additional clinical action. The team named the tendency for models to act as if they had checked images even when none were provided “mirage reasoning.”

The problem is that such models can still score highly in existing performance evaluations. The team said there were cases in which a model ranked top on a chest X-ray question-and-answer benchmark even when it answered without images. That means it is difficult to conclude an AI truly understood images simply because it scored well on existing benchmarks.

Evaluation results also varied widely depending on how questions were phrased. The team explained that scores rose when the AI was instructed to “assume there is an image and answer,” while scores fell sharply when it was explicitly told, “There is no image, so guess.” This suggests that while some models recognize the lack of images and respond cautiously, there is also a pattern of answering on the premise that images exist even when they do not.

The team proposed “B-Clean” as an evaluation approach to reduce such limitations. The method removes items that can be solved without images or are easy to infer from the question text alone, leaving only items that require seeing the actual image.

When the team applied B-Clean to three benchmarks, including “MMMU-Pro,” “MedXpertQA-MM” and “MicroVQA,” the total number of items fell to about one quarter of the original. After refining the items, not only accuracy rates but also AI model rankings changed. This highlights the possibility that prior rankings were boosted by virtual images rather than real image understanding.

The paper is a preprint that has not yet undergone peer review, and it is not the result of directly evaluating all medical AI used in real clinical settings. Even so, the team pointed out that AI premised on reading medical images can generate plausible diagnostic sentences without images, and that existing benchmarks make it difficult to sufficiently filter this out. This is drawing attention to the need for evaluation systems for multimodal AI deployed in healthcare that can verify not only performance figures but also whether answers are truly grounded in images.

Yoonseo Lee yslee@d-today.co.kr

Keyword