Search papers, labs, and topics across Lattice.
The paper introduces UniRect-CoT, a training-free framework that enhances the generation capabilities of Unified Multimodal Models (UMMs) by leveraging their inherent understanding. It addresses the capability mismatch between understanding and generation in UMMs by drawing inspiration from the "Thinking-While-Drawing" paradigm. UniRect-CoT aligns intermediate diffusion denoising steps with the target instruction, using this alignment as a self-supervisory signal to rectify UMM generation, leading to significant improvements in generation quality across various tasks.
Unlock a UMM's hidden potential: a training-free method uses the model's own understanding to guide and improve its image generation.
Unified Multimodal Models (UMMs) aim to integrate visual understanding and generation within a single structure. However, these models exhibit a notable capability mismatch, where their understanding capability significantly outperforms their generation. This mismatch indicates that the model's rich internal knowledge, while effective for understanding tasks, remains underactivated during generation. To address this, we draw inspiration from the human ``Thinking-While-Drawing''paradigm, where humans continuously reflect to activate their knowledge and rectify intermediate results. In this paper, we propose UniRect-CoT, a training-free unified rectification chain-of-thought framework. Our approach unlocks the ``free lunch''hidden in the UMM's powerful inherent understanding to continuously reflect, activating its internal knowledge and rectifying intermediate results during generation.We regard the diffusion denoising process in UMMs as an intrinsic visual reasoning process and align the intermediate results with the target instruction understood by the model, serving as a self-supervisory signal to rectify UMM generation.Extensive experiments demonstrate that UniRect-CoT can be easily integrated into existing UMMs, significantly enhancing generation quality across diverse complex tasks.