Search papers, labs, and topics across Lattice.
The paper introduces ECHO, a diffusion-based vision-language model for chest X-ray report generation that achieves significant inference speedups. ECHO uses a Direct Conditional Distillation (DCD) framework to enable stable one-step-per-block inference by mitigating the mean-field limitation of token-factorized denoisers. The model also employs a Response-Asymmetric Diffusion (RAD) training strategy to improve training efficiency. ECHO achieves an 8x inference speedup and outperforms state-of-the-art autoregressive methods on RaTE and SemScore without sacrificing clinical accuracy.
Achieve an 8x speedup in chest X-ray report generation without sacrificing clinical accuracy by distilling multi-step diffusion into a single, efficient step.
Chest X-ray report generation (CXR-RG) has the potential to substantially alleviate radiologists' workload. However, conventional autoregressive vision--language models (VLMs) suffer from high inference latency due to sequential token decoding. Diffusion-based models offer a promising alternative through parallel generation, but they still require multiple denoising iterations. Compressing multi-step denoising to a single step could further reduce latency, but often degrades textual coherence due to the mean-field bias introduced by token-factorized denoisers. To address this challenge, we propose ECHO, an efficient diffusion-based VLM (dVLM) for chest X-ray report generation. ECHO enables stable one-step-per-block inference via a novel Direct Conditional Distillation (DCD) framework, which mitigates the mean-field limitation by constructing unfactorized supervision from on-policy diffusion trajectories to encode joint token dependencies. In addition, we introduce a Response-Asymmetric Diffusion (RAD) training strategy that further improves training efficiency while maintaining model effectiveness. Extensive experiments demonstrate that ECHO surpasses state-of-the-art autoregressive methods, improving RaTE and SemScore by 64.33\% and 60.58\% respectively, while achieving an 8times inference speedup without compromising clinical accuracy.