May 27, 2026arXiv:2605.28025

MIRA: A Bilingual Benchmark for Medical Information Response Audit

Mengyu Xu, Xiwei Dai, Weiyi Wu, Chongyang Gao

AI Summary

The paper introduces MIRA, a bilingual benchmark to evaluate if LLMs provide consistent medical information across different user phrasings of the same question, varying language, register, and health literacy. The study reveals a pattern of "Differential Information Dilution" (DID) where responses to low health-literacy prompts omit key information and offer less support. A knowledge-guided mitigation prompt is shown to reduce information dilution, particularly for Claude and Qwen.

Key Contribution

LLMs answering medical questions consistently dilute information when responding to prompts indicating low health literacy, even while answering all questions posed.

Abstract

Large language models (LLMs) are increasingly used to provide public-facing health information, yet existing safety evaluations overlook whether responses preserve comparable medical information across different user phrasings of the same question. To address this, we introduce the Medical Information Response Audit (MIRA), a bilingual, controlled benchmark that assesses whether LLMs provide comparable medical information across user-side language, register, and health literacy signals. MIRA contains 4,320 prompts built from 60 medically reviewed, low-risk health questions. Across five mainstream LLMs, models answered all medical questions, but responses to low health-literacy signals consistently omitted more key information, provided fewer concrete next steps, and offered less support for independent judgment. We term this pattern Differential Information Dilution (DID). Language effects are model-specific rather than uniformly worse for non-English prompts. A comparison with 300 real-world health queries provides preliminary evidence of rank-order validity. A knowledge-guided mitigation prompt reduces information dilution for most models, with the largest reductions in underinformative simplification observed for Claude (~8%) and Qwen (~6%).

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References40

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MIRA: A Bilingual Benchmark for Medical Information Response Audit

Related Papers