ZJUApr 21, 2026arXiv:2604.19679

MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation

Liyang Li, Wen Wang, Canyu Zhao, Tianjian Feng, Zhiyue Zhao

AI Summary

MMControl introduces a dual-stream conditional injection mechanism into Diffusion Transformers (DiTs) for joint audio-video generation, enabling multi-modal control via visual and acoustic signals. This approach injects conditions like reference images/audio, depth maps, and pose sequences through bypass branches in the DiT architecture. Modality-specific guidance scaling allows dynamic adjustment of each condition's influence, resulting in fine-grained control over identity, timbre, pose, and layout.

Key Contribution

Finally, you can puppeteer both the sights and sounds of AI-generated characters, controlling their identity, voice, pose, and scene with unprecedented precision.

Abstract

Recent advances in Diffusion Transformers (DiTs) have enabled high-quality joint audio-video generation, producing videos with synchronized audio within a single model. However, existing controllable generation frameworks are typically restricted to video-only control. This restricts comprehensive controllability and often leads to suboptimal cross-modal alignment. To bridge this gap, we present MMControl, which enables users to perform Multi-Modal Control in joint audio-video generation. MMControl introduces a dual-stream conditional injection mechanism. It incorporates both visual and acoustic control signals, including reference images, reference audio, depth maps, and pose sequences, into a joint generation process. These conditions are injected through bypass branches into a joint audio-video Diffusion Transformer, enabling the model to simultaneously generate identity-consistent video and timbre-consistent audio under structural constraints. Furthermore, we introduce modality-specific guidance scaling, which allows users to independently and dynamically adjust the influence strength of each visual and acoustic condition at inference time. Extensive experiments demonstrate that MMControl achieves fine-grained, composable control over character identity, voice timbre, body pose, and scene layout in joint audio-video generation.

Computer Vision Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation

Related Papers