Search papers, labs, and topics across Lattice.
The paper introduces AnyCrowd, a Diffusion Transformer (DiT)-based video generation framework designed for scalable multi-character animation. To prevent identity entanglement, they propose an Instance-Isolated Latent Representation (IILR) that encodes character instances independently. They further introduce Tri-Stage Decoupled Attention (TSDA) and Adaptive Gated Fusion (AGF) to bind identities to driving poses while mitigating token ambiguity in overlapping regions, leading to improved identity-pose binding and consistency.
Achieve controllable multi-character animation with arbitrary numbers of characters by preventing identity entanglement and improving identity-pose binding via instance-isolated latent representations and decoupled attention.
Controllable character animation has advanced rapidly in recent years, yet multi-character animation remains underexplored. As the number of characters grows, multi-character reference encoding becomes more susceptible to latent identity entanglement, resulting in identity bleeding and reduced controllability. Moreover, learning precise and spatio-temporally consistent correspondences between reference identities and driving pose sequences becomes increasingly challenging, often leading to identity-pose mis-binding and inconsistency in generated videos. To address these challenges, we propose AnyCrowd, a Diffusion Transformer (DiT)-based video generation framework capable of scaling to an arbitrary number of characters. Specifically, we first introduce an Instance-Isolated Latent Representation (IILR), which encodes character instances independently prior to DiT processing to prevent latent identity entanglement. Building on this disentangled representation, we further propose Tri-Stage Decoupled Attention (TSDA) to bind identities to driving poses by decomposing self-attention into: (i) instance-aware foreground attention, (ii) background-centric interaction, and (iii) global foreground-background coordination. Furthermore, to mitigate token ambiguity in overlapping regions, an Adaptive Gated Fusion (AGF) module is integrated within TSDA to predict identity-aware weights, effectively fusing competing token groups into identity-consistent representations...