PKUShanghai AI LabFeb 27, 2026arXiv:2602.23996

Accelerating Masked Image Generation by Learning Latent Controlled Dynamics

Kaiwen Zhu, Quansheng Zeng, Yuandong Pu, Shuo Cao, Xiaohui Li, Yi Xin, Qi Qin, Jiayang Li, Yu Qiao, Yihao Liu

AI Summary

The paper introduces MIGM-Shortcut, a method to accelerate Masked Image Generation Models (MIGMs) by learning a lightweight model that predicts the average velocity field of feature evolution based on previous features and sampled tokens. This approach addresses the redundancy in MIGM computation caused by discarding continuous features during discrete token sampling and the limitations of existing feature caching methods. Experiments on Lumina-DiMOO demonstrate over 4x acceleration in text-to-image generation while maintaining image quality, improving the efficiency of MIGMs.

Key Contribution

You can now get 4x faster text-to-image generation from masked image models like Lumina-DiMOO, without sacrificing quality, by predicting feature evolution.

Abstract

Masked Image Generation Models (MIGMs) have achieved great success, yet their efficiency is hampered by the multiple steps of bi-directional attention. In fact, there exists notable redundancy in their computation: when sampling discrete tokens, the rich semantics contained in the continuous features are lost. Some existing works attempt to cache the features to approximate future features. However, they exhibit considerable approximation error under aggressive acceleration rates. We attribute this to their limited expressivity and the failure to account for sampling information. To fill this gap, we propose to learn a lightweight model that incorporates both previous features and sampled tokens, and regresses the average velocity field of feature evolution. The model has moderate complexity that suffices to capture the subtle dynamics while keeping lightweight compared to the original base model. We apply our method, MIGM-Shortcut, to two representative MIGM architectures and tasks. In particular, on the state-of-the-art Lumina-DiMOO, it achieves over 4x acceleration of text-to-image generation while maintaining quality, significantly pushing the Pareto frontier of masked image generation. The code and model weights are available at https://github.com/Kaiwen-Zhu/MIGM-Shortcut.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Accelerating Masked Image Generation by Learning Latent Controlled Dynamics

Related Papers