AI InclusionDidi International Business GroupTencent AIUNSWJun 11, 2025arXiv:2506.09344

Ming-Omni: A Unified Multimodal Model for Perception and Generation

A. Inclusion, Biao Gong, Cheng Zou, Chuanyang Zheng, Chunluan Zhou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, Furong Xu, Guangming Yao, Jun Zhou, Jingdong Chen, Jianxin Sun, Jiajia Liu, Jianjiang Zhu, Jun Peng, Kaixiang Ji, Kaiyou Song, Kaimeng Ren, Libin Wang, Lixiang Ru, Lele Xie, Longhua Tan, Lyuxin Xue, Lan Wang, Mochen Bai, Ning Gao, Pei Chen, Qingpei Guo, Qinglong Zhang, Qiang Xu, Rui Liu, Ruijie Xiong, Sirui Gao, Ting Liu, Taisong Li, Weilong Chai, Xinyu Xiao, Xiaomei Wang, Xiaoxue Chen, Xiaoli Lu, Xiaoyu Li, Xingning Dong, Xuzheng Yu, Yi Yuan, Yuting Gao, Yunxiao Sun, Yipeng Chen, Yi-fan Wu, Y. Lyu, Ziping Ma, Zipeng Feng, Zhijiang Fang, Zhihao Qiu, Ziyuan Huang, Zhengyu He

AI Summary

The paper introduces Ming-Omni, a unified multimodal model that processes images, text, audio, and video using modality-specific encoders and a Mixture-of-Experts (MoE) architecture called Ling with modality-specific routers. This architecture allows Ming-Omni to perform both perception and generation tasks across modalities without task-specific fine-tuning. The model achieves strong performance in speech and image generation through the integration of an advanced audio decoder and Ming-Lite-Uni, and is released as an open-source model matching GPT-4o in modality support.

Key Contribution

GPT-4o now has open-source competition: Ming-Omni matches its modality support in a single, unified model capable of perception and generation across image, text, audio, and video.

Abstract

We propose Ming-Omni, a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-Omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single model to efficiently process and fuse multimodal inputs within a unified framework, thereby facilitating diverse tasks without requiring separate models, task-specific fine-tuning, or structural redesign. Importantly, Ming-Omni extends beyond conventional multimodal models by supporting audio and image generation. This is achieved through the integration of an advanced audio decoder for natural-sounding speech and Ming-Lite-Uni for high-quality image generation, which also allow the model to engage in context-aware chatting, perform text-to-speech conversion, and conduct versatile image editing. Our experimental results showcase Ming-Omni offers a powerful solution for unified perception and generation across all modalities. Notably, our proposed Ming-Omni is the first open-source model we are aware of to match GPT-4o in modality support, and we release all code and model weights to encourage further research and development in the community.

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Speech & Audio

Citation Metrics

Citations29

Influential citations4

References96

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Ming-Omni: A Unified Multimodal Model for Perception and Generation

Related Papers