NC StateNov 27, 2025arXiv:2511.22699

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Z-Image Team, Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, Zhen Li, Zhong-Yu Li, David Liu, Dongyang Liu, Junhan Shi, Qilong Wu, Fengyi Yu, Chi Zhang, Shifeng Zhang, Shilin Zhou

AI Summary

The authors introduce Z-Image, a 6B-parameter image generation foundation model based on a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture, designed to be efficient and accessible. They optimize the model lifecycle through data curation and training curriculum, achieving full training in 314K H800 GPU hours and developing Z-Image-Turbo with sub-second inference latency and consumer-grade hardware compatibility via few-step distillation and reward post-training. Z-Image demonstrates comparable or superior performance to larger models in photorealistic image generation and bilingual text rendering, while significantly reducing computational costs.

Key Contribution

You don't need 80B parameters to rival top-tier commercial image generators: Z-Image proves that a carefully optimized 6B model can deliver comparable performance with dramatically lower computational cost.

Abstract

The landscape of high-performance image generation models is currently dominated by proprietary systems, such as Nano Banana Pro and Seedream 4.0. Leading open-source alternatives, including Qwen-Image, Hunyuan-Image-3.0 and FLUX.2, are characterized by massive parameter counts (20B to 80B), making them impractical for inference, and fine-tuning on consumer-grade hardware. To address this gap, we propose Z-Image, an efficient 6B-parameter foundation generative model built upon a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture that challenges the"scale-at-all-costs"paradigm. By systematically optimizing the entire model lifecycle -- from a curated data infrastructure to a streamlined training curriculum -- we complete the full training workflow in just 314K H800 GPU hours (approx. $630K). Our few-step distillation scheme with reward post-training further yields Z-Image-Turbo, offering both sub-second inference latency on an enterprise-grade H800 GPU and compatibility with consumer-grade hardware (<16GB VRAM). Additionally, our omni-pre-training paradigm also enables efficient training of Z-Image-Edit, an editing model with impressive instruction-following capabilities. Both qualitative and quantitative experiments demonstrate that our model achieves performance comparable to or surpassing that of leading competitors across various dimensions. Most notably, Z-Image exhibits exceptional capabilities in photorealistic image generation and bilingual text rendering, delivering results that rival top-tier commercial models, thereby demonstrating that state-of-the-art results are achievable with significantly reduced computational overhead. We publicly release our code, weights, and online demo to foster the development of accessible, budget-friendly, yet state-of-the-art generative models.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Open-Source Models & Weights

Citation Metrics

Citations25

Influential citations4

References0

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Related Papers