IndependentRochesterUTokyoJun 7, 2026arXiv:2606.08393

SMC-ITA: Sequential Monte Carlo Inference-Time Alignment for Video-to-Audio Generation

Haoyu Zhang, Yuta Oshima, Xingjian Du, Chunfeng Wang, Irene Li, Yusuke Iwasawa, Yutaka Matsuo

AI Summary

This paper addresses the challenge of inference-time alignment in video-to-audio generation by proposing Sequential Monte Carlo Inference-Time Alignment (SMC-ITA), which reformulates the problem as a search task. By integrating lookahead-based reward estimation with sequential Monte Carlo resampling, SMC-ITA adaptively reallocates computation based on multi-dimensional cross-modal rewards. The method demonstrates significant improvements, including a 55.67% reduction in DeSync and a 20.23% increase in IB-score, outperforming traditional search methods under matched computational budgets.

Key Contribution

SMC-ITA achieves a remarkable 55.67% reduction in audio-video desynchronization, setting a new standard for inference-time alignment in video-to-audio generation.

Abstract

Video-to-audio (V2A) generation must jointly satisfy audiovisual alignment, semantic consistency, temporal synchronization, and perceptual quality. While prior work has mainly focused on model architecture, multimodal conditioning, and training objectives, inference-time alignment for V2A remains underexplored. In this paper, we study inference-time alignment for flow-matching-based V2A generation and formulate it as a search problem. We propose Sequential Monte Carlo Inference-Time Alignment (SMC-ITA), which combines lookahead-based reward estimation and sequential Monte Carlo resampling to reallocate computation adaptively using multi-dimensional cross-modal rewards. SMC-ITA improves over naive single-trajectory sampling, achieving a 55.67% relative reduction in DeSync, a 20.23% improvement in IB-score, and a 15.44% improvement in Audio Quality. Under matched NFE budgets, it also achieves the best overall trade-off among the compared search baselines, outperforming Best-of-N and Beam Search. Ablation studies further show that lookahead improves the reliability of intermediate reward estimates and that systematic resampling is a strong practical default for V2A inference-time alignment.

Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...