ZJUMar 5, 2025arXiv:2503.03689

DualDiff+: Dual-Branch Diffusion for High-Fidelity Video Generation with Reward Guidance

Zhao Yang, Zezhong Qian, Xiaofan Li, Weixiang Xu, Gongpeng Zhao, Ruohong Yu, Lingsi Zhu, Longjun Liu

AI Summary

The paper introduces DualDiff, a dual-branch conditional diffusion model for high-fidelity driving scene video generation that leverages Occupancy Ray-shape Sampling (ORS) as a conditional input for enhanced foreground and background control. To improve fine-grained foreground object synthesis, the authors propose a Foreground-Aware Mask (FGM) denoising loss and a Semantic Fusion Attention (SFA) mechanism for effective multimodal fusion. The method incorporates a Reward-Guided Diffusion (RGD) framework to maintain global consistency and semantic coherence in generated videos, achieving state-of-the-art performance on the NuScenes dataset and improving downstream tasks like BEV segmentation and 3D object detection.

Key Contribution

Ditch the bounding boxes: DualDiff leverages Occupancy Ray-shape Sampling to generate driving scene videos with unprecedented fidelity, outperforming existing methods by a significant margin in both generation quality and downstream task performance.

Abstract

Accurate and high-fidelity driving scene reconstruction demands the effective utilization of comprehensive scene information as conditional inputs. Existing methods predominantly rely on 3D bounding boxes and BEV road maps for foreground and background control, which fail to capture the full complexity of driving scenes and adequately integrate multimodal information. In this work, we present DualDiff, a dual-branch conditional diffusion model designed to enhance driving scene generation across multiple views and video sequences. Specifically, we introduce Occupancy Ray-shape Sampling (ORS) as a conditional input, offering rich foreground and background semantics alongside 3D spatial geometry to precisely control the generation of both elements. To improve the synthesis of fine-grained foreground objects, particularly complex and distant ones, we propose a Foreground-Aware Mask (FGM) denoising loss function. Additionally, we develop the Semantic Fusion Attention (SFA) mechanism to dynamically prioritize relevant information and suppress noise, enabling more effective multimodal fusion. Finally, to ensure high-quality image-to-video generation, we introduce the Reward-Guided Diffusion (RGD) framework, which maintains global consistency and semantic coherence in generated videos. Extensive experiments demonstrate that DualDiff achieves state-of-the-art (SOTA) performance across multiple datasets. On the NuScenes dataset, DualDiff reduces the FID score by 4.09% compared to the best baseline. In downstream tasks, such as BEV segmentation, our method improves vehicle mIoU by 4.50% and road mIoU by 1.70%, while in BEV 3D object detection, the foreground mAP increases by 1.46%. Code will be made available at https://github.com/yangzhaojason/DualDiff.

Computer Vision Multimodal Models

Citation Metrics

Citations4

Influential citations1

References52

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

DualDiff+: Dual-Branch Diffusion for High-Fidelity Video Generation with Reward Guidance

Related Papers