Search papers, labs, and topics across Lattice.
This paper introduces a latent denoising framework to improve visual feature alignment in Large Multimodal Models (LMMs) by corrupting visual tokens with saliency-aware noise and training the LMM to recover clean teacher patch features. The framework also incorporates intra-image contrastive patch distillation to prevent representation collapse. Experiments across various multimodal benchmarks demonstrate that this approach enhances visual understanding, reasoning, and compositional robustness, while also improving resilience to common image corruptions without increasing inference cost.
LMMs can gain surprising robustness and visual understanding by learning to denoise corrupted visual tokens, even without extra inference overhead.
Large Multimodal Models (LMMs) such as LLaVA are typically trained with an autoregressive language modeling objective, providing only indirect supervision to visual tokens. This often yields weak internal visual representations and brittle behavior under distribution shift. Inspired by recent progress on latent denoising for learning high-quality visual tokenizers, we show that the same principle provides an effective form of visual supervision for improving internal visual feature alignment and multimodal understanding in LMMs. We propose a latent denoising framework that corrupts projected visual tokens using a saliency-aware mixture of masking and Gaussian noising. The LMM is trained to denoise these corrupted tokens by recovering clean teacher patch features from hidden states at a selected intermediate LLM layer using a decoder. To prevent representation collapse, our framework also preserves the teacher's intra-image similarity structure and applies intra-image contrastive patch distillation. During inference, corruption and auxiliary heads are disabled, introducing no additional inference-time overhead. Across a broad suite of standard multimodal benchmarks, our method consistently improves visual understanding and reasoning over strong baselines, and yields clear gains on compositional robustness benchmarks (e.g., NaturalBench). Moreover, under ImageNet-C-style non-adversarial common corruptions applied to benchmark images, our method maintains higher accuracy and exhibits reduced degradation at both moderate and severe corruption levels. Our code is available at https://github.com/dhruvashp/latent-denoising-for-lmms.