Snap ResearchApr 2, 2026arXiv:2604.02330

ActionParty: Multi-Subject Action Binding in Generative Video Games

Alexander Pondaven, Ziyi Wu, Igor Gilitschenski, Philip Torr, Sergey Tulyakov, Fabio Pizzati, Aliaksandr Siarohin

AI Summary

ActionParty is introduced to address the challenge of action binding in video diffusion models, enabling control of multiple agents in generative video games. The model uses subject state tokens and a spatial biasing mechanism to disentangle global video frame rendering from individual action-controlled subject updates. Evaluated on the Melting Pot benchmark, ActionParty demonstrates improved action-following accuracy and identity consistency when controlling up to seven players across diverse environments.

Key Contribution

World models can now handle complex multi-agent interactions, controlling up to seven players simultaneously with improved action accuracy and identity consistency.

Abstract

Recent advances in video diffusion have enabled the development of"world models"capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. For this purpose, we propose ActionParty, an action controllable multi-subject world model for generative video games. It introduces subject state tokens, i.e. latent variables that persistently capture the state of each subject in the scene. By jointly modeling state tokens and video latents with a spatial biasing mechanism, we disentangle global video frame rendering from individual action-controlled subject updates. We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments. Our results show significant improvements in action-following accuracy and identity consistency, while enabling robust autoregressive tracking of subjects through complex interactions.

Computer Vision World Models & Planning

Citation Metrics

Citations0

Influential citations0

References91

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ActionParty: Multi-Subject Action Binding in Generative Video Games

Related Papers