NVIDIAFudanSJTUZJUMar 12, 2026arXiv:2603.11647

OmniForcing: Unleashing Real-time Joint Audio-Visual Generation

Yaofeng Su, Yuming Li, Yuming Li, Zeyue Xue, Jie Huang, Siming Fu, Haoran Li, Ying Li, Zezhong Qian, Haoyang Huang, Nan Duan

AI Summary

OmniForcing distills a high-quality offline audio-visual diffusion model into a real-time, autoregressive generator, addressing latency issues in existing bidirectional models. To stabilize training during causal distillation, the method introduces Asymmetric Block-Causal Alignment with a zero-truncation Global Prefix to handle modality asynchrony, and an Audio Sink Token mechanism with Identity RoPE to mitigate audio token sparsity. Joint Self-Forcing Distillation further corrects cross-modal errors, enabling state-of-the-art streaming generation at 25 FPS while preserving synchronization and visual quality.

Key Contribution

Achieve real-time, synchronized audio-visual generation at 25 FPS by distilling a bidirectional diffusion model into a fast, autoregressive architecture, overcoming training instability with novel alignment and token handling techniques.

Abstract

Recent joint audio-visual diffusion models achieve remarkable generation quality but suffer from high latency due to their bidirectional attention dependencies, hindering real-time applications. We propose OmniForcing, the first framework to distill an offline, dual-stream bidirectional diffusion model into a high-fidelity streaming autoregressive generator. However, naively applying causal distillation to such dual-stream architectures triggers severe training instability, due to the extreme temporal asymmetry between modalities and the resulting token sparsity. We address the inherent information density gap by introducing an Asymmetric Block-Causal Alignment with a zero-truncation Global Prefix that prevents multi-modal synchronization drift. The gradient explosion caused by extreme audio token sparsity during the causal shift is further resolved through an Audio Sink Token mechanism equipped with an Identity RoPE constraint. Finally, a Joint Self-Forcing Distillation paradigm enables the model to dynamically self-correct cumulative cross-modal errors from exposure bias during long rollouts. Empowered by a modality-independent rolling KV-cache inference scheme, OmniForcing achieves state-of-the-art streaming generation at sim25 FPS on a single GPU, maintaining multi-modal synchronization and visual quality on par with the bidirectional teacher.Project Page: https://omniforcing.com{https://omniforcing.com}

Inference & Quantization Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

OmniForcing: Unleashing Real-time Joint Audio-Visual Generation

Related Papers