MIT CSAILNankai UniversityNJUPKUZJUApr 21, 2026arXiv:2604.19473

TS-Attn: Temporal-wise Separable Attention for Multi-Event Video Generation

Hongyu Zhang, Yufan Deng, Zilin Pan, Qibin Hou, Zhiyang Dou, Daquan Zhou

AI Summary

The paper introduces Temporal-wise Separable Attention (TS-Attn), a novel training-free attention mechanism designed to improve multi-event video generation from complex temporal descriptions. TS-Attn addresses temporal misalignment and conflicting attention coupling by dynamically rearranging attention distribution to ensure temporal awareness and global coherence. Integrating TS-Attn into pre-trained text-to-video models significantly boosts StoryEval-Bench scores (33.5% and 16.4% improvements on Wan2.1-T2V-14B and Wan2.2-T2V-A14B respectively) with minimal increase in inference time.

Key Contribution

Multi-event video generation gets a 33% quality boost with TS-Attn, a training-free attention mechanism that dynamically aligns video content with complex temporal prompts.

Abstract

Generating high-quality videos from complex temporal descriptions that contain multiple sequential actions is a key unsolved problem. Existing methods are constrained by an inherent trade-off: using multiple short prompts fed sequentially into the model improves action fidelity but compromises temporal consistency, while a single complex prompt preserves consistency at the cost of prompt-following capability. We attribute this problem to two primary causes: 1) temporal misalignment between video content and the prompt, and 2) conflicting attention coupling between motion-related visual objects and their associated text conditions. To address these challenges, we propose a novel, training-free attention mechanism, Temporal-wise Separable Attention (TS-Attn), which dynamically rearranges attention distribution to ensure temporal awareness and global coherence in multi-event scenarios. TS-Attn can be seamlessly integrated into various pre-trained text-to-video models, boosting StoryEval-Bench scores by 33.5% and 16.4% on Wan2.1-T2V-14B and Wan2.2-T2V-A14B with only a 2% increase in inference time. It also supports plug-and-play usage across models for multi-event image-to-video generation. The source code and project page are available at https://github.com/Hong-yu-Zhang/TS-Attn.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

TS-Attn: Temporal-wise Separable Attention for Multi-Event Video Generation

Related Papers