KuaishouFeb 8, 2026arXiv:2602.07854

Geometry-Aware Rotary Position Embedding for Consistent Video World Model

Chendong Xiang, Jiajun Liu, Jintao Zhang, Xiao Yang, Zhengwei Fang, Shizun Wang, Zijun Wang, Yingtian Zou, Hang Su, Jun Zhu

AI Summary

The paper addresses the problem of geometric drift in video world models, where models struggle to maintain stable scene structures over long trajectories, especially during loop closures. They introduce ViewRope, a geometry-aware rotary position embedding that encodes camera-ray directions into the video transformer's self-attention mechanism. This approach, combined with Geometry-Aware Frame-Sparse Attention, improves long-term consistency and reduces computational costs, as validated by their newly proposed ViewBench diagnostic suite.

Key Contribution

Achieve significantly more stable and consistent video world models by encoding camera-ray geometry directly into the self-attention mechanism, outperforming screen-space positional embeddings.

Abstract

Predictive world models that simulate future observations under explicit camera control are fundamental to interactive AI. Despite rapid advances, current systems lack spatial persistence: they fail to maintain stable scene structures over long trajectories, frequently hallucinating details when cameras revisit previously observed locations. We identify that this geometric drift stems from reliance on screen-space positional embeddings, which conflict with the projective geometry required for 3D consistency. We introduce \textbf{ViewRope}, a geometry-aware encoding that injects camera-ray directions directly into video transformer self-attention layers. By parameterizing attention with relative ray geometry rather than pixel locality, ViewRope provides a model-native inductive bias for retrieving 3D-consistent content across temporal gaps. We further propose \textbf{Geometry-Aware Frame-Sparse Attention}, which exploits these geometric cues to selectively attend to relevant historical frames, improving efficiency without sacrificing memory consistency. We also present \textbf{ViewBench}, a diagnostic suite measuring loop-closure fidelity and geometric drift. Our results demonstrate that ViewRope substantially improves long-term consistency while reducing computational costs.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations2

Influential citations0

References38

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Geometry-Aware Rotary Position Embedding for Consistent Video World Model

Related Papers