Apr 15, 2026arXiv:2604.13540

Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding

Yibo Jiang, Tao Wu, Rui Jiang, Yehao Lu, Chaoxiang Cai, Zequn Qin, Xi Li, Xi Li

AI Summary

The paper introduces UniRect-CoT, a training-free framework that enhances the generation capabilities of Unified Multimodal Models (UMMs) by leveraging their inherent understanding. It addresses the capability mismatch between understanding and generation in UMMs by drawing inspiration from the "Thinking-While-Drawing" paradigm. UniRect-CoT aligns intermediate diffusion denoising steps with the target instruction, using this alignment as a self-supervisory signal to rectify UMM generation, leading to significant improvements in generation quality across various tasks.

Key Contribution

Unlock a UMM's hidden potential: a training-free method uses the model's own understanding to guide and improve its image generation.

Abstract

Unified Multimodal Models (UMMs) aim to integrate visual understanding and generation within a single structure. However, these models exhibit a notable capability mismatch, where their understanding capability significantly outperforms their generation. This mismatch indicates that the model's rich internal knowledge, while effective for understanding tasks, remains underactivated during generation. To address this, we draw inspiration from the human ``Thinking-While-Drawing''paradigm, where humans continuously reflect to activate their knowledge and rectify intermediate results. In this paper, we propose UniRect-CoT, a training-free unified rectification chain-of-thought framework. Our approach unlocks the ``free lunch''hidden in the UMM's powerful inherent understanding to continuously reflect, activating its internal knowledge and rectifying intermediate results during generation.We regard the diffusion denoising process in UMMs as an intrinsic visual reasoning process and align the intermediate results with the target instruction understood by the model, serving as a self-supervisory signal to rectify UMM generation.Extensive experiments demonstrate that UniRect-CoT can be easily integrated into existing UMMs, significantly enhancing generation quality across diverse complex tasks.

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding

Related Papers