Mar 16, 2026arXiv:2603.15415

AnyCrowd: Instance-Isolated Identity-Pose Binding for Arbitrary Multi-Character Animation

Zhenyu Xie, Ji Xia, Michael Kampffmeyer, Panwen Hu, Zehua Ma, Yujian Zheng, Jing Wang, Zheng Chong, Xujie Zhang, Xianhang Cheng, Xiaodan Liang, Hao Li

AI Summary

The paper introduces AnyCrowd, a Diffusion Transformer (DiT)-based video generation framework designed for scalable multi-character animation. To prevent identity entanglement, they propose an Instance-Isolated Latent Representation (IILR) that encodes character instances independently. They further introduce Tri-Stage Decoupled Attention (TSDA) and Adaptive Gated Fusion (AGF) to bind identities to driving poses while mitigating token ambiguity in overlapping regions, leading to improved identity-pose binding and consistency.

Key Contribution

Achieve controllable multi-character animation with arbitrary numbers of characters by preventing identity entanglement and improving identity-pose binding via instance-isolated latent representations and decoupled attention.

Abstract

Controllable character animation has advanced rapidly in recent years, yet multi-character animation remains underexplored. As the number of characters grows, multi-character reference encoding becomes more susceptible to latent identity entanglement, resulting in identity bleeding and reduced controllability. Moreover, learning precise and spatio-temporally consistent correspondences between reference identities and driving pose sequences becomes increasingly challenging, often leading to identity-pose mis-binding and inconsistency in generated videos. To address these challenges, we propose AnyCrowd, a Diffusion Transformer (DiT)-based video generation framework capable of scaling to an arbitrary number of characters. Specifically, we first introduce an Instance-Isolated Latent Representation (IILR), which encodes character instances independently prior to DiT processing to prevent latent identity entanglement. Building on this disentangled representation, we further propose Tri-Stage Decoupled Attention (TSDA) to bind identities to driving poses by decomposing self-attention into: (i) instance-aware foreground attention, (ii) background-centric interaction, and (iii) global foreground-background coordination. Furthermore, to mitigate token ambiguity in overlapping regions, an Adaptive Gated Fusion (AGF) module is integrated within TSDA to predict identity-aware weights, effectively fusing competing token groups into identity-consistent representations...

Computer Vision Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

AnyCrowd: Instance-Isolated Identity-Pose Binding for Arbitrary Multi-Character Animation

Related Papers