RUCMay 29, 2026arXiv:2605.31603

Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

Jiazheng Xing, Hangjie Yuan, Lingling Cai, Xinyu Liu, Yujie Wei, Fei Du, Hai Ci, Tao Feng, Jiasheng Tang, Weihua Chen, Fan Wang, Yong Liu

AI Summary

Lumos-Nexus is introduced as a two-stage framework for training-efficient unified video generation, first aligning a lightweight generator with the understanding block, then using Unified Progressive Frequency Bridging (UPFB) to hand off generation to a high-capacity pretrained generator during inference. UPFB operates in a shared latent space, enabling coarse-to-fine refinement and high-fidelity video generation without sacrificing reasoning quality. The introduction of VR-Bench helps evaluate reasoning-driven video generation, and experiments show Lumos-Nexus achieves significant improvements in visual realism and temporal coherence on VBench and strong reasoning-based generative performance on VR-Bench.

Key Contribution

Achieve high-fidelity video generation without compromising reasoning by progressively handing off generation from a lightweight, reasoning-aligned generator to a high-capacity pretrained generator in a shared latent space.

Abstract

Connector-based video unified models have demonstrated strong capability in instruction-grounded video synthesis, but integrating a large high-fidelity generator into the unified training loop is computationally prohibitive, limiting achievable visual quality. We therefore propose Lumos-Nexus, a training-efficient unified video generation framework that facilitates the development of strong reasoning-driven generation capabilities while significantly enhancing visual fidelity. Lumos-Nexus adopts a two-stage design: 1) During training, only a lightweight generator is aligned with the understanding block to learn to take in reasoning-driven semantic control. 2) During inference, we introduce Unified Progressive Frequency Bridging (UPFB) to progressively hand off generation to a high-capacity pretrained generator in the shared latent space, enabling coarse-to-fine refinement and producing high-fidelity videos without compromising reasoning quality. To fill the gap in reasoning-driven video generation benchmarks, we introduce VR-Bench, which assesses a model's capability to translate inferred intent into coherent and semantically aligned video content. Extensive experiments demonstrate that Lumos-Nexus achieves substantial gains in visual realism and temporal coherence on VBench, while exhibiting strong reasoning-based generative performance on VR-Bench. Code and models are available at https://jiazheng-xing.github.io/nexus-lumos-home/.

Computer Vision Multimodal Models Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

Related Papers