Nankai UniversityNJUSTPKUTongyi LabApr 28, 2026arXiv:2604.25819

Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

Yupeng Zhou, Yupeng Zhou, Lianghua Huang, Lianghua Huang, Zhifan Wu, Zhifan Wu, Jiabao Wang, Jiabao Wang, Yupeng Shi, Yupeng Shi, Biao Jiang, Biao Jiang, Daquan Zhou, Daquan Zhou, Yu Liu, Yu Liu, Ming-Ming Cheng, Ming-Ming Cheng, Qibin Hou, Qibin Hou

AI Summary

The paper introduces Mutual Forcing, a novel framework for fast autoregressive audio-video generation that achieves long-horizon synchronization by jointly training unimodal generators. It avoids distillation from a bidirectional teacher by integrating few-step and multi-step generation within a single weight-shared autoregressive model, enabling self-distillation and improved training-inference consistency. Experiments demonstrate that Mutual Forcing matches or surpasses strong baselines requiring significantly more sampling steps, achieving substantial gains in efficiency and quality.

Key Contribution

Skip the bulky bidirectional teacher: this new method trains a fast, causal audio-video generator directly, slashing sampling steps while maintaining top-tier quality.

Abstract

In this work, we propose Mutual Forcing, a framework for fast autoregressive audio-video generation with long-horizon audio-video synchronization. Our approach addresses two key challenges: joint audio-video modeling and fast autoregressive generation. To ease joint audio-video optimization, we adopt a two-stage training strategy: we first train uni-modal generators and then couple them into a unified audio-video model for joint training on paired data. For streaming generation, we ask whether a native fast causal audio-video model can be trained directly, instead of following existing streaming distillation pipelines that typically train a bidirectional model first and then convert it into a causal generator through multiple distillation stages. Our answer is Mutual Forcing, which builds directly on native autoregressive model and integrates few-step and multi-step generation within a single weight-shared model, enabling self-distillation and improved training-inference consistency. The multi-step mode improves the few-step mode via self-distillation, while the few-step mode generates historical context during training to improve training-inference consistency; because the two modes share parameters, these two effects reinforce each other within a single model. Compared with prior approaches such as Self-Forcing, Mutual Forcing removes the need for an additional bidirectional teacher model, supports more flexible training sequence lengths, reduces training overhead, and allows the model to improve directly from real paired data rather than a fixed teacher. Experiments show that Mutual Forcing matches or surpasses strong baselines that require around 50 sampling steps while using only 4 to 8 steps, demonstrating substantial advantages in both efficiency and quality. The project page is available at https://mutualforcing.github.io.

Computer Vision Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

Related Papers