Search papers, labs, and topics across Lattice.
This paper addresses the challenge of inference-time alignment in video-to-audio generation by proposing Sequential Monte Carlo Inference-Time Alignment (SMC-ITA), which reformulates the problem as a search task. By integrating lookahead-based reward estimation with sequential Monte Carlo resampling, SMC-ITA adaptively reallocates computation based on multi-dimensional cross-modal rewards. The method demonstrates significant improvements, including a 55.67% reduction in DeSync and a 20.23% increase in IB-score, outperforming traditional search methods under matched computational budgets.
SMC-ITA achieves a remarkable 55.67% reduction in audio-video desynchronization, setting a new standard for inference-time alignment in video-to-audio generation.
Video-to-audio (V2A) generation must jointly satisfy audiovisual alignment, semantic consistency, temporal synchronization, and perceptual quality. While prior work has mainly focused on model architecture, multimodal conditioning, and training objectives, inference-time alignment for V2A remains underexplored. In this paper, we study inference-time alignment for flow-matching-based V2A generation and formulate it as a search problem. We propose Sequential Monte Carlo Inference-Time Alignment (SMC-ITA), which combines lookahead-based reward estimation and sequential Monte Carlo resampling to reallocate computation adaptively using multi-dimensional cross-modal rewards. SMC-ITA improves over naive single-trajectory sampling, achieving a 55.67% relative reduction in DeSync, a 20.23% improvement in IB-score, and a 15.44% improvement in Audio Quality. Under matched NFE budgets, it also achieves the best overall trade-off among the compared search baselines, outperforming Best-of-N and Beam Search. Ablation studies further show that lookahead improves the reliability of intermediate reward estimates and that systematic resampling is a strong practical default for V2A inference-time alignment.