Mar 18, 2026arXiv:2603.17651

Anchoring and Rescaling Attention for Semantically Coherent Inbetweening

Tae Eun Choi, Sumin Shim, Junhyeok Kim, Seong Jae Hwang

AI Summary

This paper introduces Keyframe-anchored Attention Bias and Rescaled Temporal RoPE to improve generative inbetweening (GI) by providing semantic and temporal guidance from keyframes and text. Keyframe-anchored Attention Bias guides each intermediate frame, while Rescaled Temporal RoPE enhances frame consistency by allowing self-attention to attend to keyframes more faithfully. Evaluated on the new TGI-Bench benchmark, the proposed method achieves state-of-the-art results in frame consistency, semantic fidelity, and pace stability without additional training.

Key Contribution

Synthesizing realistic intermediate video frames just got a whole lot better, thanks to a novel attention mechanism that anchors to keyframes and text prompts for improved consistency and semantic alignment.

Abstract

Generative inbetweening (GI) seeks to synthesize realistic intermediate frames between the first and last keyframes beyond mere interpolation. As sequences become sparser and motions larger, previous GI models struggle with inconsistent frames with unstable pacing and semantic misalignment. Since GI involves fixed endpoints and numerous plausible paths, this task requires additional guidance gained from the keyframes and text to specify the intended path. Thus, we give semantic and temporal guidance from the keyframes and text onto each intermediate frame through Keyframe-anchored Attention Bias. We also better enforce frame consistency with Rescaled Temporal RoPE, which allows self-attention to attend to keyframes more faithfully. TGI-Bench, the first benchmark specifically designed for text-conditioned GI evaluation, enables challenge-targeted evaluation to analyze GI models. Without additional training, our method achieves state-of-the-art frame consistency, semantic fidelity, and pace stability for both short and long sequences across diverse challenges.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Anchoring and Rescaling Attention for Semantically Coherent Inbetweening

Related Papers