CASCorresponding authorsECNUFudanGIST GuangdongHITIEEEKuaishouNational Technology Innovation CenterNJUTAUTencent AIUSTCSep 28, 2025arXiv:2509.23951

HunyuanImage 3.0 Technical Report

Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, Tiankai Hang, Duojun Huang, Jie Jiang, Zhengkai Jiang, Weijie Kong, Changlin Li, Donghao Li, Junzhe Li, Xin Li, Yang Li, Zhenxi Li, Zhimin Li, Jiaxin Lin, Linus, Lu-Hao Liu, Shu Liu, Songtao Liu, Yu Liu, Yuhong Liu, Yanxin Long, Fanbin Lu, Qinglin Lu, Yuyan Peng, Yuanbo Peng, Xiang-Yu Shen, Yi-Ping Shi, Jiale Tao, Yang-Dan Tao, Qianhui Tian, Pengfei Wan, Chunyu Wang, Kai Wang, Lei Wang, Linqing Wang, Lucas Wang, Qixun Wang, Weiyang Wang, Hao Wen, Bing Wu, Jianbing Wu, Yue Wu, Senhao Xie, Fangzhou Yang, Miles Yang, Xiaofeng Yang, Xuan Yang, Zhantao Yang, Jingmiao Yu, Zhengang Yuan, Chao Zhang, Jianwei Zhang, Pei-pei Zhang, Shiyuan Zhang, Tao Zhang, Weigang Zhang, Yepeng Zhang, Yingfang Zhang, Zihao Zhang, Zijian Zhang, Penghao Zhao, Zhiyuan Zhao, Xuefei Zhe, Jian-Xiang Zhu, Zhao Zhong

AI Summary

The authors introduce HunyuanImage 3.0, a multimodal model unifying understanding and generation within an autoregressive framework, with a focus on image generation. They achieved this by using meticulous data curation, advanced architecture design, a native Chain-of-Thoughts schema, progressive pre-training, aggressive post-training, and efficient infrastructure. The resulting Mixture-of-Experts (MoE) model, with over 80 billion parameters (13B active per token), demonstrates state-of-the-art performance in text-image alignment and visual quality, rivaling previous models.

Key Contribution

The largest open-source image generative model to date, HunyuanImage 3.0, achieves state-of-the-art performance using a Mixture-of-Experts architecture and native Chain-of-Thoughts schema.

Abstract

We present HunyuanImage 3.0, a native multimodal model that unifies multimodal understanding and generation within an autoregressive framework, with its image generation module publicly available. The achievement of HunyuanImage 3.0 relies on several key components, including meticulous data curation, advanced architecture design, a native Chain-of-Thoughts schema, progressive model pre-training, aggressive model post-training, and an efficient infrastructure that enables large-scale training and inference. With these advancements, we successfully trained a Mixture-of-Experts (MoE) model comprising over 80 billion parameters in total, with 13 billion parameters activated per token during inference, making it the largest and most powerful open-source image generative model to date. We conducted extensive experiments and the results of automatic and human evaluation of text-image alignment and visual quality demonstrate that HunyuanImage 3.0 rivals previous state-of-the-art models. By releasing the code and weights of HunyuanImage 3.0, we aim to enable the community to explore new ideas with a state-of-the-art foundation model, fostering a dynamic and vibrant multimodal ecosystem. All open source assets are publicly available at https://github.com/Tencent-Hunyuan/HunyuanImage-3.0

Architecture Design (Transformers, SSMs, MoE)Data Curation & Synthetic Data Multimodal Models

Citation Metrics

Citations32

Influential citations7

References39

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

HunyuanImage 3.0 Technical Report

Related Papers