NUSUofTFeb 22, 2026arXiv:2602.19163

JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation

Kai Liu, Kai Liu, Yanhao Zheng, Yanhao Zheng, Kai Wang, Kai Wang, Shengqiong Wu, Rongjunchen Zhang, Rongjunchen Zhang, Jiebo Luo, Jiebo Luo, Dimitrios Hatzinakos, Dimitrios Hatzinakos, Ziwei Liu, Ziwei Liu, Hao Fei, Tat-Seng Chua, Tat-Seng Chua

AI Summary

The paper introduces JavisDiT++, a framework for joint audio-video generation (JAVG) designed to improve generation quality, temporal synchrony, and alignment with human preferences. JavisDiT++ incorporates a modality-specific mixture-of-experts (MS-MoE) design for enhanced cross-modal interaction and single-modal generation, along with a temporal-aligned RoPE (TA-RoPE) strategy for frame-level synchronization. Furthermore, the authors introduce an audio-video direct preference optimization (AV-DPO) method to align model outputs with human preferences, achieving state-of-the-art performance with a relatively small dataset.

Key Contribution

Achieve SOTA joint audio-video generation with JavisDiT++ using just 1M public training examples, rivaling performance of models trained on proprietary datasets.

Abstract

AIGC has rapidly expanded from text-to-image generation toward high-quality multimodal synthesis across video and audio. Within this context, joint audio-video generation (JAVG) has emerged as a fundamental task that produces synchronized and semantically aligned sound and vision from textual descriptions. However, compared with advanced commercial models such as Veo3, existing open-source methods still suffer from limitations in generation quality, temporal synchrony, and alignment with human preferences. To bridge the gap, this paper presents JavisDiT++, a concise yet powerful framework for unified modeling and optimization of JAVG. First, we introduce a modality-specific mixture-of-experts (MS-MoE) design that enables cross-modal interaction efficacy while enhancing single-modal generation quality. Then, we propose a temporal-aligned RoPE (TA-RoPE) strategy to achieve explicit, frame-level synchronization between audio and video tokens. Besides, we develop an audio-video direct preference optimization (AV-DPO) method to align model outputs with human preference across quality, consistency, and synchrony dimensions. Built upon Wan2.1-1.3B-T2V, our model achieves state-of-the-art performance merely with around 1M public training entries, significantly outperforming prior approaches in both qualitative and quantitative evaluations. Comprehensive ablation studies have been conducted to validate the effectiveness of our proposed modules. All the code, model, and dataset are released at https://JavisVerse.github.io/JavisDiT2-page.

Computer Vision Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References79

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation

Related Papers