Search papers, labs, and topics across Lattice.
ControlFoley is introduced as a unified video-to-audio (V2A) generation framework that provides precise control over video, text, and reference audio inputs. The framework uses a joint visual encoding paradigm integrating CLIP with a spatio-temporal audio-visual encoder and temporal-timbre decoupling to improve alignment and controllability. Results on the new VGGSound-TVC benchmark and other V2A tasks demonstrate state-of-the-art performance and superior controllability under cross-modal conflict compared to existing methods and an industrial V2A system.
ControlFoley lets you generate audio from video with unprecedented control over text descriptions and reference audio, even when those inputs conflict.
Recent advances in video-to-audio (V2A) generation enable high-quality audio synthesis from visual content, yet achieving robust and fine-grained controllability remains challenging. Existing methods suffer from weak textual controllability under visual-text conflict and imprecise stylistic control due to entangled temporal and timbre information in reference audio. Moreover, the lack of standardized benchmarks limits systematic evaluation. We propose ControlFoley, a unified multimodal V2A framework that enables precise control over video, text, and reference audio. We introduce a joint visual encoding paradigm that integrates CLIP with a spatio-temporal audio-visual encoder to improve alignment and textual controllability. We further propose temporal-timbre decoupling to suppress redundant temporal cues while preserving discriminative timbre features. In addition, we design a modality-robust training scheme with unified multimodal representation alignment (REPA) and random modality dropout. We also present VGGSound-TVC, a benchmark for evaluating textual controllability under varying degrees of visual-text conflict. Extensive experiments demonstrate state-of-the-art performance across multiple V2A tasks, including text-guided, text-controlled, and audio-controlled generation. ControlFoley achieves superior controllability under cross-modal conflict while maintaining strong synchronization and audio quality, and shows competitive or better performance compared to an industrial V2A system. Code, models, datasets, and demos are available at: https://yjx-research.github.io/ControlFoley/.