Shanghai AI LabTencent AIUESTCZJUMay 27, 2026arXiv:2605.28422

VITAL: Visual-Semantic Dual Supervision for Enhanced and Interpretable Latent Reasoning in Medical MLLMs

Qiaoru Li, Shaotian Liang, Jintao Chen, Yuxiang Cai, Yankai Jiang

AI Summary

The paper introduces VITAL, a latent-space reasoning framework for medical MLLMs that uses visual-semantic dual supervision to address modality collapse and improve interpretability. VITAL employs an auxiliary text decoder to reconstruct reasoning chains and a visual projector to regress ROI features, both discarded at inference for efficiency. Experiments on a new 61K medical imaging dataset demonstrate that VITAL significantly outperforms existing latent reasoning methods and even surpasses medical MLLMs trained on much larger datasets, achieving SOTA results.

Key Contribution

Medical MLLMs can achieve state-of-the-art reasoning performance, rivaling trillion-parameter models, by learning interpretable latent spaces with visual-semantic supervision, even without increasing inference costs.

Abstract

Latent reasoning enables reasoning over continuous hidden states rather than explicit tokens, avoiding the language bottleneck and inference overhead of chain-of-thought for medical VQA. However, existing methods suffer from modality collapse, insufficient visual supervision, and train-inference mismatch. Moreover, their opaque latent states offer no interpretability, which is critical in clinical applications. We propose VITAL, a latent-space reasoning framework for medical MLLMs with visual-semantic dual supervision: an auxiliary text decoder reconstructs reasoning chains from latent states, while a visual projector regresses ROI features from a frozen, independent medical vision encoder. Both modules are discarded at inference with zero overhead, yet can be re-attached post-hoc for dual interpretability, providing textual and visual explanations of the reasoning process without sacrificing efficiency. We construct a 61K dataset spanning 9 imaging modalities, exceeding prior medical visual latent reasoning datasets by an order of magnitude. Experiments on 7 benchmarks show that VITAL consistently and substantially outperforms the backbone, all latent reasoning baselines, and medical MLLMs trained on far larger data, achieving state-of-the-art results competitive with trillion-parameter proprietary models.

Interpretability & Mechanistic Interp Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VITAL: Visual-Semantic Dual Supervision for Enhanced and Interpretable Latent Reasoning in Medical MLLMs

Related Papers