Search papers, labs, and topics across Lattice.
OmniForcing distills a high-quality offline audio-visual diffusion model into a real-time, autoregressive generator by addressing training instabilities arising from modality asymmetry and token sparsity. This is achieved through Asymmetric Block-Causal Alignment with a zero-truncation Global Prefix, an Audio Sink Token mechanism with Identity RoPE, and Joint Self-Forcing Distillation. The resulting model achieves state-of-the-art streaming generation at ~25 FPS on a single GPU, matching the visual quality and multi-modal synchronization of the bidirectional teacher model.
Achieve real-time (25 FPS on a single GPU) audio-visual generation with quality comparable to offline diffusion models by distilling a bidirectional model into a streaming autoregressive generator.
Recent joint audio-visual diffusion models achieve remarkable generation quality but suffer from high latency due to their bidirectional attention dependencies, hindering real-time applications. We propose OmniForcing, the first framework to distill an offline, dual-stream bidirectional diffusion model into a high-fidelity streaming autoregressive generator. However, naively applying causal distillation to such dual-stream architectures triggers severe training instability, due to the extreme temporal asymmetry between modalities and the resulting token sparsity. We address the inherent information density gap by introducing an Asymmetric Block-Causal Alignment with a zero-truncation Global Prefix that prevents multi-modal synchronization drift. The gradient explosion caused by extreme audio token sparsity during the causal shift is further resolved through an Audio Sink Token mechanism equipped with an Identity RoPE constraint. Finally, a Joint Self-Forcing Distillation paradigm enables the model to dynamically self-correct cumulative cross-modal errors from exposure bias during long rollouts. Empowered by a modality-independent rolling KV-cache inference scheme, OmniForcing achieves state-of-the-art streaming generation at $\sim$25 FPS on a single GPU, maintaining multi-modal synchronization and visual quality on par with the bidirectional teacher.\textbf{Project Page:} \href{https://omniforcing.com}{https://omniforcing.com}