CurtinMar 9, 2026arXiv:2603.08028

Controllable Complex Human Motion Video Generation via Text-to-Skeleton Cascades

Ashkan Taghipour, Morteza Ghahremani, Zinuo Li, Hamid Laga, F. Boussaid, Farid Boussaid, Mohammed Bennamoun

AI Summary

This paper introduces a two-stage cascaded framework for generating complex human motion videos from text descriptions. The framework first uses an autoregressive text-to-skeleton model to generate 2D pose sequences, and then employs a pose-conditioned video diffusion model with a novel adaptive layer fusion (DINO-ALF) mechanism to synthesize videos. To facilitate research in this area, the authors also contribute a new Blender-based synthetic dataset of acrobatic motions.

Key Contribution

Forget tedious pose annotations: this text-to-video approach generates realistic acrobatic human motions by cascading a text-to-skeleton model with a pose-conditioned diffusion model.

Abstract

Generating videos of complex human motions such as flips, cartwheels, and martial arts remains challenging for current video diffusion models. Text-only conditioning is temporally ambiguous for fine-grained motion control, while explicit pose-based controls, though effective, require users to provide complete skeleton sequences that are costly to produce for long and dynamic actions. We propose a two-stage cascaded framework that addresses both limitations. First, an autoregressive text-to-skeleton model generates 2D pose sequences from natural language descriptions by predicting each joint conditioned on previously generated poses. This design captures long-range temporal dependencies and inter-joint coordination required for complex motions. Second, a pose-conditioned video diffusion model synthesizes videos from a reference image and the generated skeleton sequence. It employs DINO-ALF (Adaptive Layer Fusion), a multi-level reference encoder that preserves appearance and clothing details under large pose changes and self-occlusions. To address the lack of publicly available datasets for complex human motion video generation, we introduce a Blender-based synthetic dataset containing 2,000 videos with diverse characters performing acrobatic and stunt-like motions. The dataset provides full control over appearance, motion, and environment. It fills an important gap because existing benchmarks significantly under-represent acrobatic motions while web-collected datasets raise copyright and privacy concerns. Experiments on our synthetic dataset and the Motion-X Fitness benchmark show that our text-to-skeleton model outperforms prior methods on FID, R-precision, and motion diversity. Our pose-to-video model also achieves the best results among all compared methods on VBench metrics for temporal consistency, motion smoothness, and subject preservation.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References78

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Controllable Complex Human Motion Video Generation via Text-to-Skeleton Cascades

Related Papers