Evaluating Hallucination and Diagnostic Reliability of LLMs on Medical Image-Based Multiple Choice Tasks

  • Debapriya Hazra
  • , Shayani Mukherjee
  • , Suman Kumar
  • , Subhajit Chatterjee
  • , Prince Waqas Khan
  • , Khizar Abbas*
  • *Corresponding author for this work

Research output: Contribution to journalArticleScientificpeer-review

Abstract

The growing integration of large language models into biomedical and clinical workflows has sparked interest in their potential for diagnostic support, particularly in interpreting complex medical data. However, as these models are increasingly applied to sensitive decision-making tasks, there is a critical need to evaluate not only their accuracy but also the reliability and clinical grounding of their reasoning. Addressing this gap, we present a systematic framework to assess both diagnostic correctness and explanation quality of language models on medical image-based multiple-choice tasks. Our evaluation spans four state-of-the-art models applied to 30 diverse medical cases across 10 clinical specializations and five imaging modalities. Each case includes multiple diagnostic images and is paired with one correct answer and three distractors targeting specific reasoning vulnerabilities (visual similarity, anatomical misplacement, and semantic plausibility). To assess model performance, we introduce a comprehensive set of metrics, including hallucination rate, reasoning score, anatomical correctness, and grounding deviation score, which quantifies the alignment between model-generated explanations and clinically expected imaging features. Results reveal that while some models achieve moderate diagnostic accuracy, they often rely on shallow patterns or hallucinated logic. High deviation scores, even for correct predictions, underscore the disconnect between answer selection and clinical reasoning. Weak and incorrect reasoning remains the most frequent failure mode. These findings emphasize the importance of evaluating how and why models arrive at their answers, rather than focusing solely on whether the answers are correct. Our framework offers critical insights into the reasoning behind model predictions, supporting greater interpretability and safer use of language models in biomedical diagnosis.

Original languageEnglish
Number of pages8
JournalIEEE Journal of Biomedical and Health Informatics
DOIs
Publication statusE-pub ahead of print - 15 Oct 2025
Publication typeA1 Journal article-refereed

Keywords

  • Clinical Reasoning
  • Hallucination
  • Large Language Models (LLMs)
  • Medical Imaging

Publication forum classification

  • Publication forum level 2

ASJC Scopus subject areas

  • Computer Science Applications
  • Health Informatics
  • Electrical and Electronic Engineering
  • Health Information Management

Fingerprint

Dive into the research topics of 'Evaluating Hallucination and Diagnostic Reliability of LLMs on Medical Image-Based Multiple Choice Tasks'. Together they form a unique fingerprint.

Cite this