Search papers, labs, and topics across Lattice.
The paper introduces VITAL, a latent-space reasoning framework for medical MLLMs that uses visual-semantic dual supervision to address modality collapse and improve interpretability. VITAL employs an auxiliary text decoder to reconstruct reasoning chains and a visual projector to regress ROI features, both discarded at inference for efficiency. Experiments on a new 61K medical imaging dataset demonstrate that VITAL significantly outperforms existing latent reasoning methods and even surpasses medical MLLMs trained on much larger datasets, achieving SOTA results.
Medical MLLMs can achieve state-of-the-art reasoning performance, rivaling trillion-parameter models, by learning interpretable latent spaces with visual-semantic supervision, even without increasing inference costs.
Latent reasoning enables reasoning over continuous hidden states rather than explicit tokens, avoiding the language bottleneck and inference overhead of chain-of-thought for medical VQA. However, existing methods suffer from modality collapse, insufficient visual supervision, and train-inference mismatch. Moreover, their opaque latent states offer no interpretability, which is critical in clinical applications. We propose VITAL, a latent-space reasoning framework for medical MLLMs with visual-semantic dual supervision: an auxiliary text decoder reconstructs reasoning chains from latent states, while a visual projector regresses ROI features from a frozen, independent medical vision encoder. Both modules are discarded at inference with zero overhead, yet can be re-attached post-hoc for dual interpretability, providing textual and visual explanations of the reasoning process without sacrificing efficiency. We construct a 61K dataset spanning 9 imaging modalities, exceeding prior medical visual latent reasoning datasets by an order of magnitude. Experiments on 7 benchmarks show that VITAL consistently and substantially outperforms the backbone, all latent reasoning baselines, and medical MLLMs trained on far larger data, achieving state-of-the-art results competitive with trillion-parameter proprietary models.