Search papers, labs, and topics across Lattice.
The paper introduces DIVE, a novel distillation framework for long-form medical report generation that addresses the issue of uniform token treatment in existing methods. DIVE employs decisive-token supervision, which upweights pathology-related tokens and the EOS event during training, and state-conditioned dynamic steering, which uses hidden-state-dependent adapters to adjust the injected signal as decoding progresses. Experiments on MIMIC-CXR and CheXpert Plus demonstrate that DIVE achieves state-of-the-art results across various metrics, including BLEU-4, ROUGE-L, and RadGraph F1, while remaining competitive on CheXbert F1.
Uniformly treating all tokens as equally important during distillation hurts long-form generation, but DIVE's decisive-token supervision and dynamic steering can fix it.
Distilling demonstration effects into hidden-space interventions offers a lightweight alternative to full finetuning. However, existing multimodal variants are mostly evaluated on short-form tasks, where outputs end after a few tokens. Extending these methods to long-form generation exposes a fundamental yet underexamined limitation: token-level distillation implicitly treats all output tokens as equally informative, but long-form outputs are dominated by high-frequency template and grammatical tokens, while the tokens that actually determine output quality are sparsely distributed. In medical report generation (MRG), two such decisive tokens stand out: pathology-related tokens that determine diagnostic content, and the end-of-sequence (EOS) event that determines termination. Both receive insufficient supervision under uniform cross-entropy, and autoregressive decoding further compounds the problem by drifting away from teacher-forced trajectories. We propose DIVE, a frozen-backbone distillation framework that addresses long-form report generation through two complementary mechanisms matched to these failures. Decisive-token supervision restores supervision balance by upweighting the cross-entropy contribution of pathology-related tokens and the EOS event, ensuring that content fidelity and termination are learned during training rather than imposed at decoding time. State-conditioned dynamic steering replaces fixed open-loop residuals with hidden-state-dependent adapters, allowing the injected signal to adapt as decoding drifts. Experiments on MIMIC-CXR and CheXpert Plus with two medical VLM backbones show that DIVE consistently ranks among the strongest methods across lexical and clinical-proxy metrics. Our method achieves the best BLEU-4, ROUGE-L, and RadGraph F1 in all dataset--backbone settings, while remaining competitive on coarse label-level CheXbert F1.