Tsinghua AIBaiduHKUMar 19, 2026arXiv:2603.19228

SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

Xinyao Zhang, Xinyao Zhang, Wenkai Dong, Yuxin Song, Yuxin Song, Bo Fang, Bo Fang, Qi Zhang, Qi Zhang, Jing Wang, Jing Wang, Hui Zhang, Hui Zhang, Haocheng Feng, Yu Lu, Yuqing Lu, Hang Zhou, Hang Zhou, Chun Yuan, Chun Yuan, Chun Yuan, Jingdong Wang

AI Summary

The paper introduces SAMA, a framework for instruction-guided video editing that factorizes the task into semantic anchoring and motion modeling to improve both semantic accuracy and motion fidelity. Semantic Anchoring predicts semantic tokens and video latents at sparse anchor frames for instruction-aware structural planning, while Motion Alignment pre-trains the backbone on motion-centric video restoration tasks to internalize temporal dynamics. Results show SAMA achieves state-of-the-art performance among open-source models and is competitive with commercial systems, even exhibiting strong zero-shot editing capabilities after factorized pre-training.

Key Contribution

Instruction-guided video editing can achieve impressive zero-shot performance simply by pre-training on motion-centric video restoration tasks *before* fine-tuning on paired editing data.

Abstract

Current instruction-guided video editing models struggle to simultaneously balance precise semantic modifications with faithful motion preservation. While existing approaches rely on injecting explicit external priors (e.g., VLM features or structural conditions) to mitigate these issues, this reliance severely bottlenecks model robustness and generalization. To overcome this limitation, we present SAMA (factorized Semantic Anchoring and Motion Alignment), a framework that factorizes video editing into semantic anchoring and motion modeling. First, we introduce Semantic Anchoring, which establishes a reliable visual anchor by jointly predicting semantic tokens and video latents at sparse anchor frames, enabling purely instruction-aware structural planning. Second, Motion Alignment pre-trains the same backbone on motion-centric video restoration pretext tasks (cube inpainting, speed perturbation, and tube shuffle), enabling the model to internalize temporal dynamics directly from raw videos. SAMA is optimized with a two-stage pipeline: a factorized pre-training stage that learns inherent semantic-motion representations without paired video-instruction editing data, followed by supervised fine-tuning on paired editing data. Remarkably, the factorized pre-training alone already yields strong zero-shot video editing ability, validating the proposed factorization. SAMA achieves state-of-the-art performance among open-source models and is competitive with leading commercial systems (e.g., Kling-Omni). Code, models, and datasets will be released.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References77

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

Related Papers