Search papers, labs, and topics across Lattice.
The paper introduces Geometry-Aware Spherical Sampling (GASS), a novel method to improve diversity in text-to-image generation by explicitly controlling prompt-dependent and prompt-independent variations in CLIP embedding space. GASS decomposes diversity into orthogonal directions representing semantic variation related to the prompt and prompt-independent variation, then increases the geometric projection spread of generated image embeddings along both axes. Experiments on U-Net and DiT backbones demonstrate that GASS enhances diversity with minimal impact on image fidelity and semantic alignment.
Text-to-image models get a diversity boost without sacrificing quality, thanks to a new sampling method that disentangles and controls prompt-dependent and independent variations in the CLIP embedding space.
Despite high semantic alignment, modern text-to-image (T2I) generative models still struggle to synthesize diverse images from a given prompt. This lack of diversity not only restricts user choice, but also risks amplifying societal biases. In this work, we enhance the T2I diversity through a geometric lens. Unlike most existing methods that rely primarily on entropy-based guidance to increase sample dissimilarity, we introduce Geometry-Aware Spherical Sampling (GASS) to enhance diversity by explicitly controlling both prompt-dependent and prompt-independent sources of variation. Specifically, we decompose the diversity measure in CLIP embeddings using two orthogonal directions: the text embedding, which captures semantic variation related to the prompt, and an identified orthogonal direction that captures prompt-independent variation (e.g., backgrounds). Based on this decomposition, GASS increases the geometric projection spread of generated image embeddings along both axes and guides the T2I sampling process via expanded predictions along the generation trajectory. Our experiments on different frozen T2I backbones (U-Net and DiT, diffusion and flow) and benchmarks demonstrate the effectiveness of disentangled diversity enhancement with minimal impact on image fidelity and semantic alignment.