Jun 1, 2026arXiv:2606.02453

Initialization is Half the Battle: Generating Diverse Images from a Guidance Potential Posterior

AI Summary

This paper addresses the issue of mode collapse in generative models by introducing a novel initialization strategy called Diversity-inducing Initialization (DivIn), which selects initial noise based on a guidance potential posterior. By leveraging Langevin dynamics, DivIn effectively navigates the initialization landscape to steer away from regions prone to collapse while remaining anchored to the valid data manifold. Experimental results demonstrate that DivIn significantly enhances diversity in both class-to-image and text-to-image generation tasks, outperforming existing methods and expanding the diversity-quality trade-off when combined with trajectory-based approaches.

Key Contribution

Standard Gaussian initialization can lead to mode collapse, but a new approach using guidance potential posteriors dramatically enhances image diversity.

Abstract

Despite the remarkable fidelity of generative models, they frequently suffer from mode collapse. Existing strategies for enhancing diversity predominantly focus on intervening during the generation trajectory. We identify a critical oversight that the standard Gaussian initialization often causes trajectories to collapse into dominant modes because it is agnostic to the guidance potential landscape. In this work, we formulate selecting the initial noise from a guidance potential posterior, which effectively re-weights the prior towards diversity-rich regions. To sample from this distribution efficiently, we introduce Diversity-inducing Initialization (DivIn), which leverages Langevin dynamics to actively navigate the initialization landscape, steering initial noise away from collapsing regions while anchoring them to the valid data manifold. Our method serves as an inference-time diversity enhancement compatible with both diffusion and flow matching models. Extensive experiments show that DivIn exhibits a superior performance in both class-to-image and text-to-image scenarios. Furthermore, we highlight that as DivIn is orthogonal to trajectory-based methods, combining them significantly expands the diversity-quality Pareto frontier beyond what either achieves in isolation.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Initialization is Half the Battle: Generating Diverse Images from a Guidance Potential Posterior

Related Papers