MIT CSAILAI for ResponsibleBeth Israel Deaconess Medical CenterCavendish UniversityFederal University of São PauloHarvardJHUMakerere UniversityMbarara University of Science and TechnologyTechnical University of Applied Sciences Lübeck (TH Lübeck)Uganda Cancer InstituteUNCApr 14, 2026arXiv:2604.12152

Domain-Specific Latent Representations Improve the Fidelity of Diffusion-Based Medical Image Super-Resolution

S. Cajas, Ashaba Judith, R. Gorijavolu, Sahil Kapadia, Hillary Clinton Kasimbazi, L. Kinyera, Emmanuel Kwesiga, Sri Sri Jaithra Varma Manthena, L. Nakayama, Ninsiima Doreen, L. Celi

AI Summary

This paper investigates the impact of the VAE component on the performance of latent diffusion models for medical image super-resolution. By swapping the standard Stable Diffusion VAE with a domain-specific MedVAE, the authors demonstrate a significant PSNR improvement (+2.91 to +3.29 dB) across various medical imaging modalities. The study further isolates the improvement to high-frequency bands and establishes that VAE reconstruction quality is a strong predictor of downstream super-resolution performance, independent of hallucination rates.

Key Contribution

You can boost medical image super-resolution fidelity by over 3dB just by swapping in a domain-specific VAE, no fancy diffusion architecture needed.

Abstract

Latent diffusion models for medical image super-resolution universally inherit variational autoencoders designed for natural photographs. We show that this default choice, not the diffusion architecture, is the dominant constraint on reconstruction quality. In a controlled experiment holding all other pipeline components fixed, replacing the generic Stable Diffusion VAE with MedVAE, a domain-specific autoencoder pretrained on more than 1.6 million medical images, yields +2.91 to +3.29 dB PSNR improvement across knee MRI, brain MRI, and chest X-ray (n = 1,820; Cohen's d = 1.37 to 1.86, all p<10^{-20}, Wilcoxon signed-rank). Wavelet decomposition localises the advantage to the finest spatial frequency bands encoding anatomically relevant fine structure. Ablations across inference schedules, prediction targets, and generative architectures confirm the gap is stable within plus or minus 0.15 dB, while hallucination rates remain comparable between methods (Cohen's h<0.02 across all datasets), establishing that reconstruction fidelity and generative hallucination are governed by independent pipeline components. These results provide a practical screening criterion: autoencoder reconstruction quality, measurable without diffusion training, predicts downstream SR performance (R^2 = 0.67), suggesting that domain-specific VAE selection should precede diffusion architecture search. Code and trained model weights are publicly available at https://github.com/sebasmos/latent-sr.

Computer Vision Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References58

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Domain-Specific Latent Representations Improve the Fidelity of Diffusion-Based Medical Image Super-Resolution

Related Papers