ByteDanceFudanHKUJun 8, 2026arXiv:2606.09156

OmniGen-AR: AutoRegressive Any-to-Image Generation

Junke Wang, Xun Wang, Qiushan Guo, Peize Sun, Weilin Huang, Zuxuan Wu, Yu-Gang Jiang

AI Summary

This paper introduces OmniGen-AR, a unified autoregressive framework for Any-to-Image generation that integrates diverse conditional inputs such as text, spatial signals, and visual context. By employing a shared visual tokenizer and Disentangled Causal Attention (DCA), the model effectively mitigates information leakage while maintaining high fidelity in image synthesis. OmniGen-AR achieves state-of-the-art results on several benchmarks, including a score of 0.63 on GenEval and 80.02 on VBench, showcasing its versatility and performance in real-world applications.

Key Contribution

OmniGen-AR can seamlessly generate images from a wide array of conditions, outperforming existing methods that are limited to single-modality inputs.

Abstract

Autoregressive (AR) models have demonstrated strong potential in visual generation, offering superior performance with simple architectures and optimization objectives. However, existing methods are typically limited to single-modality conditions, e.g., text, restricting their applicability in real-world scenarios that demand image synthesis from diverse controls. In this work, we present OmniGen-AR, a unified autoregressive framework for Any-to-Image generation. By discretizing various visual conditions through a shared visual tokenizer and text prompts with a text tokenizer, OmniGen-AR supports a broad spectrum of conditional inputs within a single model, including text (text-to-image generation), spatial signals (segmentation-to-image and depth-to-image), and visual context (image editing, frame prediction, and text-to-video generation). To mitigate the risk of information leakage from condition tokens to content tokens, we introduce Disentangled Causal Attention (DCA), which separates the full-sequence causal mask into condition causal attention and content causal attention. It serves as a training-time regularizer without affecting the standard next-token prediction during inference. With this design, OmniGen-AR achieves new state-of-the-art or at least competitive results across a range of benchmark, e.g., 0.63 on GenEval and 80.02 on VBench, demonstrating its effectiveness in flexible and high-fidelity visual generation.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

OmniGen-AR: AutoRegressive Any-to-Image Generation

Related Papers