Search papers, labs, and topics across Lattice.
ActionParty is introduced to address the challenge of action binding in video diffusion models, enabling control of multiple agents in generative video games. The model uses subject state tokens and a spatial biasing mechanism to disentangle global video frame rendering from individual action-controlled subject updates. Evaluated on the Melting Pot benchmark, ActionParty demonstrates improved action-following accuracy and identity consistency when controlling up to seven players across diverse environments.
World models can now handle complex multi-agent interactions, controlling up to seven players simultaneously with improved action accuracy and identity consistency.
Recent advances in video diffusion have enabled the development of"world models"capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. For this purpose, we propose ActionParty, an action controllable multi-subject world model for generative video games. It introduces subject state tokens, i.e. latent variables that persistently capture the state of each subject in the scene. By jointly modeling state tokens and video latents with a spatial biasing mechanism, we disentangle global video frame rendering from individual action-controlled subject updates. We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments. Our results show significant improvements in action-following accuracy and identity consistency, while enabling robust autoregressive tracking of subjects through complex interactions.