ByteDanceInstitute of Artificial Intelligence (TeleAI)SJTUApr 15, 2026arXiv:2604.13427

A Unified Conditional Flow for Motion Generation, Editing, and Intra-Structural Retargeting

Junlin Li, Xinhao Song, Siqi Wang, Haibin Huang, Yili Zhao

AI Summary

This paper introduces a unified conditional flow model for text-driven motion generation, editing, and intra-structural retargeting by framing these tasks as conditional transport problems. They leverage rectified flow matching and a DiT-style transformer with per-joint tokenization and explicit joint self-attention to enforce kinematic dependencies. Experiments on SnapMoGen and Mixamo demonstrate that a single model can perform all three tasks with improved structural consistency compared to task-specific baselines.

Key Contribution

Motion editing and retargeting are two sides of the same generative coin, solvable with a single conditional flow model.

Abstract

Text-driven motion editing and intra-structural retargeting, where source and target share topology but may differ in bone lengths, are traditionally handled by fragmented pipelines with incompatible inputs and representations: editing relies on specialized generative steering, while retargeting is deferred to geometric post-processing. We present a unifying perspective where both tasks are cast as instances of conditional transport within a single generative framework. By leveraging recent advances in flow matching, we demonstrate that editing and retargeting are fundamentally the same generative task, distinguished only by which conditioning signal, semantic or structural, is modulated during inference. We implement this vision via a rectified-flow motion model jointly conditioned on text prompts and target skeletal structures. Our architecture extends a DiT-style transformer with per-joint tokenization and explicit joint self-attention to strictly enforce kinematic dependencies, while a multi-condition classifier-free guidance strategy balances text adherence with skeletal conformity. Experiments on SnapMoGen and a multi-character Mixamo subset show that a single trained model supports text-to-motion generation, zero-shot editing, and zero-shot intra-structural retargeting. This unified approach simplifies deployment and improves structural consistency compared to task-specific baselines.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

A Unified Conditional Flow for Motion Generation, Editing, and Intra-Structural Retargeting

Related Papers