Search papers, labs, and topics across Lattice.
This paper introduces BEAT, a framework for automatic movie trailer generation that elastically aligns shots with music using a novel music-visual alignment encoder (MuVA) and an energy-adaptive dynamic programming algorithm (Bar-DP). MuVA is trained with Sinkhorn-regularized two-stage learning to create compact cross-modal embeddings, while Bar-DP optimizes shot selection based on musical energy. Evaluated on the new TrailerArena benchmark, BEAT demonstrates state-of-the-art performance in shot selection, ordering, and perceptual quality, generating complete trailers end-to-end.
Forget rigid shot-music mappings: BEAT's elastic alignment framework finally captures the dynamic rhythm of professional movie trailer editing.
Automatic movie trailer generation must select shots from a full-length film and synchronize them with background music. Existing methods either relegate music alignment to post-processing or enforce rigid one-to-one shot-music mappings, overlooking that professional editing rhythm is elastic: rapid cuts accompany high-energy passages while sustained shots span quieter bars. We introduce BEAT, a framework that addresses this gap with two core components: MuVA, a compact music-visual alignment encoder trained with Sinkhorn-regularized two-stage learning, and Bar-DP, an energy-adaptive dynamic programming algorithm that produces elastic many-to-one alignments following musical dynamics. These components are integrated into a five-phase agentic pipeline that grounds the core alignment in learned cross-modal features while coordinating higher-level creative decisions through structured text signals. To support comprehensive evaluation, we also introduce TrailerArena, a benchmark with 20+ metrics across four complementary dimensions. On TrailerArena, BEAT achieves state-of-the-art performance across shot selection, ordering, and perceptual quality, while producing fully composed trailers end-to-end.