D data while Ours only to the textTechnionNov 9, 2025arXiv:2511.08633

Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising

Assaf Singer, Noam Rotstein, Amir Mann, Ron Kimmel, O. Litany

AI Summary

The paper introduces Time-to-Move (TTM), a training-free framework for motion- and appearance-controlled video generation using image-to-video diffusion models. TTM leverages crude reference animations as motion cues, adapting SDEdit's approach to the video domain and preserving appearance through image conditioning. The core of TTM is dual-clock denoising, a region-dependent strategy that balances motion alignment with natural dynamics, achieving state-of-the-art results without additional training or runtime costs.

Key Contribution

Control video generation with unprecedented precision, using only crude user-provided motion cues, thanks to a training-free "dual-clock denoising" approach that aligns motion where you want it, and lets the diffusion model fill in the rest.

Abstract

Diffusion-based video generation can create realistic videos, yet existing image- and text-based conditioning fails to offer precise motion control. Prior methods for motion-conditioned synthesis typically require model-specific fine-tuning, which is computationally expensive and restrictive. We introduce Time-to-Move (TTM), a training-free, plug-and-play framework for motion- and appearance-controlled video generation with image-to-video (I2V) diffusion models. Our key insight is to use crude reference animations obtained through user-friendly manipulations such as cut-and-drag or depth-based reprojection. Motivated by SDEdit's use of coarse layout cues for image editing, we treat the crude animations as coarse motion cues and adapt the mechanism to the video domain. We preserve appearance with image conditioning and introduce dual-clock denoising, a region-dependent strategy that enforces strong alignment in motion-specified regions while allowing flexibility elsewhere, balancing fidelity to user intent with natural dynamics. This lightweight modification of the sampling process incurs no additional training or runtime cost and is compatible with any backbone. Extensive experiments on object and camera motion benchmarks show that TTM matches or exceeds existing training-based baselines in realism and motion control. Beyond this, TTM introduces a unique capability: precise appearance control through pixel-level conditioning, exceeding the limits of text-only prompting. Visit our project page for video examples and code: https://time-to-move.github.io/.

Computer Vision Multimodal Models

Citation Metrics

Citations1

Influential citations0

References0

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising

Related Papers