MiLM PlusSJTUXiaomi AI LabApr 16, 2026arXiv:2604.15086

ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling

Jianxuan Yang, Xinyue Guo, Zhi Cheng, Kai Wang, Lipan Zhang, Jinjie Hu, Qiang Ji, Yihua Cao, Yihao Meng, Zhaoyue Cui, Mengmei Liu, Meng Meng, Jian Luan, Jian Luan

AI Summary

ControlFoley is introduced as a unified video-to-audio (V2A) generation framework that provides precise control over video, text, and reference audio inputs. The framework uses a joint visual encoding paradigm integrating CLIP with a spatio-temporal audio-visual encoder and temporal-timbre decoupling to improve alignment and controllability. Results on the new VGGSound-TVC benchmark and other V2A tasks demonstrate state-of-the-art performance and superior controllability under cross-modal conflict compared to existing methods and an industrial V2A system.

Key Contribution

ControlFoley lets you generate audio from video with unprecedented control over text descriptions and reference audio, even when those inputs conflict.

Abstract

Recent advances in video-to-audio (V2A) generation enable high-quality audio synthesis from visual content, yet achieving robust and fine-grained controllability remains challenging. Existing methods suffer from weak textual controllability under visual-text conflict and imprecise stylistic control due to entangled temporal and timbre information in reference audio. Moreover, the lack of standardized benchmarks limits systematic evaluation. We propose ControlFoley, a unified multimodal V2A framework that enables precise control over video, text, and reference audio. We introduce a joint visual encoding paradigm that integrates CLIP with a spatio-temporal audio-visual encoder to improve alignment and textual controllability. We further propose temporal-timbre decoupling to suppress redundant temporal cues while preserving discriminative timbre features. In addition, we design a modality-robust training scheme with unified multimodal representation alignment (REPA) and random modality dropout. We also present VGGSound-TVC, a benchmark for evaluating textual controllability under varying degrees of visual-text conflict. Extensive experiments demonstrate state-of-the-art performance across multiple V2A tasks, including text-guided, text-controlled, and audio-controlled generation. ControlFoley achieves superior controllability under cross-modal conflict while maintaining strong synchronization and audio quality, and shows competitive or better performance compared to an industrial V2A system. Code, models, datasets, and demos are available at: https://yjx-research.github.io/ControlFoley/.

Computer Vision Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ControlFoley: Unified and Controllable Video-to-Audio Generation with Cross-Modal Conflict Handling

Related Papers