Adobe ResearchJilinSJTUUniversity of CaliforniaUniversity of California at MercedMay 21, 2026arXiv:2605.22818

MotiMotion: Motion-Controlled Video Generation with Visual Reasoning

Lee Hsin-Ying, Hanwen Jiang, Yiqun Mei, Ming-Hsuan Yang, Zhixin Shu

AI Summary

MotiMotion addresses the limitations of current motion-controlled video generation models by reformulating motion control as a reasoning-then-generation problem. It leverages a training-free vision-language reasoner to refine primary trajectories and hallucinate secondary motions, enhancing causal consistency. A confidence-aware control scheme modulates guidance strength, improving motion naturalness, as validated by a new benchmark, MotiBench, showcasing improved object behaviors and interactions.

Key Contribution

Motion-controlled video generation can now produce more plausible and natural results by reasoning about motion and its consequences, rather than rigidly following user-defined trajectories.

Abstract

Current motion-controlled image-to-video generation models rigidly follow user-provided trajectories that are often sparse, imprecise, and causally incomplete. Such reliance often yields unnatural or implausible outcomes, especially by missing secondary causal consequences. To address this, we introduce MotiMotion, a novel framework that reformulates motion control as a reasoning-then-generation problem. To encourage causally grounded and commonsense-consistent interactions, we leverage a training-free vision-language reasoner to refine image-space coordinates of primary trajectories and to hallucinate plausible secondary motions. To further improve motion naturalness, we propose a confidence-aware control scheme that modulates guidance strength, enabling the model to closely follow high-confidence plans while correcting artifacts under low-confidence inputs with its internal generative priors. To support systematic evaluation, we curate a new image-to-video benchmark, MotiBench, consisting of interaction-centric scenes where new events are triggered by motion. Both VLM-based evaluation and a human study on MotiBench demonstrate that MotiMotion produces videos with more plausible object behaviors and interaction, and is preferred over existing approaches.

Computer Vision Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MotiMotion: Motion-Controlled Video Generation with Visual Reasoning

Related Papers