Search papers, labs, and topics across Lattice.
The paper introduces SoulX-LiveAct, a novel autoregressive diffusion model for hour-scale real-time human animation. It addresses challenges in existing AR diffusion methods by proposing Neighbor Forcing, a diffusion-step-consistent AR formulation, and a structured ConvKV memory mechanism for efficient long-term context handling. Experiments demonstrate significant improvements in training convergence, generation quality, and inference efficiency, achieving state-of-the-art performance with real-time streaming on two NVIDIA H100/H200 GPUs.
Forget short looping animations – this new diffusion model generates hour-long, real-time human animations with lip-sync accuracy and emotional expressiveness, all while running on just two GPUs.
Autoregressive (AR) diffusion models offer a promising framework for sequential generation tasks such as video synthesis by combining diffusion modeling with causal inference. Although they support streaming generation, existing AR diffusion methods struggle to scale efficiently. In this paper, we identify two key challenges in hour-scale real-time human animation. First, most forcing strategies propagate sample-level representations with mismatched diffusion states, causing inconsistent learning signals and unstable convergence. Second, historical representations grow unbounded and lack structure, preventing effective reuse of cached states and severely limiting inference efficiency. To address these challenges, we propose Neighbor Forcing, a diffusion-step-consistent AR formulation that propagates temporally adjacent frames as latent neighbors under the same noise condition. This design provides a distribution-aligned and stable learning signal while preserving drifting throughout the AR chain. Building upon this, we introduce a structured ConvKV memory mechanism that compresses the keys and values in causal attention into a fixed-length representation, enabling constant-memory inference and truly infinite video generation without relying on short-term motion-frame memory. Extensive experiments demonstrate that our approach significantly improves training convergence, hour-scale generation quality, and inference efficiency compared to existing AR diffusion methods. Numerically, LiveAct enables hour-scale real-time human animation and supports 20 FPS real-time streaming inference on as few as two NVIDIA H100 or H200 GPUs. Quantitative results demonstrate that our method attains state-of-the-art performance in lip-sync accuracy, human animation quality, and emotional expressiveness, with the lowest inference cost.