Feb 25, 2026arXiv:2602.22208

Solaris: Building a Multiplayer Video World Model in Minecraft

George Savva, Georgy Savva, Oscar Michel, Oscar Michel, Daohan Lu, Suppakit Waiwitlikhit, Suppakit Waiwitlikhit, Timothy Meehan, T. Meehan, Dhairya Mishra, Srivats Poddar, Srivats Poddar, Jack Lu, Jack Lu, Saining Xie

AI Summary

The paper introduces Solaris, a multiplayer video world model trained on Minecraft, designed to simulate consistent multi-view observations, addressing the limitations of existing single-agent video world models. To facilitate this, the authors developed a robust data collection system for multiplayer environments, capturing synchronized videos and actions from multiple agents. They train Solaris using a staged pipeline incorporating bidirectional, causal, and Checkpointed Self Forcing training, achieving superior performance compared to existing baselines in tasks like multiplayer movement and view consistency.

Key Contribution

Solaris lets you simulate consistent multi-view Minecraft observations, opening the door to more realistic and interactive multi-agent world models.

Abstract

Existing action-conditioned video generation models (video world models) are limited to single-agent perspectives, failing to capture the multi-agent interactions of real-world environments. We introduce Solaris, a multiplayer video world model that simulates consistent multi-view observations. To enable this, we develop a multiplayer data system designed for robust, continuous, and automated data collection on video games such as Minecraft. Unlike prior platforms built for single-player settings, our system supports coordinated multi-agent interaction and synchronized videos + actions capture. Using this system, we collect 12.64 million multiplayer frames and propose an evaluation framework for multiplayer movement, memory, grounding, building, and view consistency. We train Solaris using a staged pipeline that progressively transitions from single-player to multiplayer modeling, combining bidirectional, causal, and Self Forcing training. In the final stage, we introduce Checkpointed Self Forcing, a memory-efficient Self Forcing variant that enables a longer-horizon teacher. Results show our architecture and training design outperform existing baselines. Through open-sourcing our system and models, we hope to lay the groundwork for a new generation of multi-agent world models.

Computer Vision Data Curation & Synthetic Data World Models & Planning

Citation Metrics

Citations0

Influential citations0

References67

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Solaris: Building a Multiplayer Video World Model in Minecraft

Related Papers