Tsinghua AIFudanTU DarmstadtUQFeb 9, 2026arXiv:2602.08794

MOVA: Towards Scalable and Synchronized Video-Audio Generation

SII-OpenMOSS Team Donghua Yu, Mingshu Chen, Qi Chen, Qi Luo, Qianyi Wu, Qinyuan Cheng, Ruixiao Li, Tianyi Liang, Wenbo Zhang, Wenming Tu, Xiangyu Peng, Yang Gao, Yanru Huo, Ying Zhu, Yinze Luo, Yiyan Zhang, Yuerong Song, Zhe Xu, Zhiyu Zhang, Chenchen Yang, Cheng Chang, Chushu Zhou, Hanfu Chen, Jiaxi Li, Jingqi Tong, Junxi Liu, Shimin Li, Shiqi Jiang, Wei Jiang, Zhaoye Fei, Zhiyuan Ning, Chunguo Li, Chenhui Li, Ziwei He, Zengfeng Huang, Xie Chen, Xipeng Qiu

AI Summary

The paper introduces MOVA, an open-source Mixture-of-Experts (MoE) model with 32B parameters (18B active) designed for synchronized video and audio generation. MOVA addresses the limitations of cascaded pipelines and closed-source systems by enabling simultaneous generation of high-quality audio-visual content, including lip-synced speech and environment-aware sound effects. The model supports Image-Text to Video-Audio (IT2VA) generation and is released with code for efficient inference, LoRA fine-tuning, and prompt enhancement.

Key Contribution

Open-source MOVA lets you generate synchronized, high-quality video and audio—including realistic lip sync—without relying on closed-source systems.

Abstract

Audio is indispensable for real-world video, yet generation models have largely overlooked audio components. Current approaches to producing audio-visual content often rely on cascaded pipelines, which increase cost, accumulate errors, and degrade overall quality. While systems such as Veo 3 and Sora 2 emphasize the value of simultaneous generation, joint multimodal modeling introduces unique challenges in architecture, data, and training. Moreover, the closed-source nature of existing systems limits progress in the field. In this work, we introduce MOVA (MOSS Video and Audio), an open-source model capable of generating high-quality, synchronized audio-visual content, including realistic lip-synced speech, environment-aware sound effects, and content-aligned music. MOVA employs a Mixture-of-Experts (MoE) architecture, with a total of 32B parameters, of which 18B are active during inference. It supports IT2VA (Image-Text to Video-Audio) generation task. By releasing the model weights and code, we aim to advance research and foster a vibrant community of creators. The released codebase features comprehensive support for efficient inference, LoRA fine-tuning, and prompt enhancement.

Computer Vision Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References75

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MOVA: Towards Scalable and Synchronized Video-Audio Generation

Related Papers