Mar 29, 2026arXiv:2603.27520

TokenDial: Continuous Attribute Control in Text-to-Video via Spatiotemporal Token Offsets

Zhixuan Liu, Peter Schaldenbrand, Yijun Li, Long Mai, Aniruddha Mahapatra, Cusuh Ham, Jean Oh, Jui-hsien Wang

AI Summary

TokenDial introduces a method for continuous attribute control in text-to-video generation by applying learned, attribute-specific offsets in the spatiotemporal visual patch-token space of a pretrained model. These offsets, trained using semantic direction matching and motion-magnitude scaling, allow for fine-grained adjustments to appearance and motion dynamics without retraining the entire backbone. The approach achieves superior controllability and edit quality compared to existing methods, as validated through quantitative metrics and human evaluations.

Key Contribution

Unlock slider-style control over video attributes like motion magnitude and effect intensity without retraining your text-to-video model – just nudge the spatiotemporal tokens.

Abstract

We present TokenDial, a framework for continuous, slider-style attribute control in pretrained text-to-video generation models. While modern generators produce strong holistic videos, they offer limited control over how much an attribute changes (e.g., effect intensity or motion magnitude) without drifting identity, background, or temporal coherence. TokenDial is built on the observation: additive offsets in the intermediate spatiotemporal visual patch-token space form a semantic control direction, where adjusting the offset magnitude yields coherent, predictable edits for both appearance and motion dynamics. We learn attribute-specific token offsets without retraining the backbone, using pretrained understanding signals: semantic direction matching for appearance and motion-magnitude scaling for motion. We demonstrate TokenDial's effectiveness on diverse attributes and prompts, achieving stronger controllability and higher-quality edits than state-of-the-art baselines, supported by extensive quantitative evaluation and human studies.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References42

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

TokenDial: Continuous Attribute Control in Text-to-Video via Spatiotemporal Token Offsets

Related Papers