Mar 3, 2026arXiv:2603.02697

ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling

Jianing Zhang, Yiying Yang, Wei Cheng, Xiaoyun Yuan

AI Summary

The paper introduces ShareVerse, a video generation framework for multi-agent shared world modeling, addressing the limitations of existing methods in constructing unified shared environments with multi-agent interaction. They construct a large-scale multi-agent interactive dataset using CARLA, featuring diverse scenarios and multi-view videos. ShareVerse integrates a spatial concatenation strategy for multi-view videos and cross-agent attention blocks into a pretrained video model, enabling consistent shared world modeling and accurate agent positioning across 49-frame video generation.

Key Contribution

Generate consistent multi-agent videos in a shared world using a novel framework that fuses multi-view data and cross-agent attention.

Abstract

This paper presents ShareVerse, a video generation framework enabling multi-agent shared world modeling, addressing the gap in existing works that lack support for unified shared world construction with multi-agent interaction. ShareVerse leverages the generation capability of large video models and integrates three key innovations: 1) A dataset for large-scale multi-agent interactive world modeling is built on the CARLA simulation platform, featuring diverse scenes, weather conditions, and interactive trajectories with paired multi-view videos (front/ rear/ left/ right views per agent) and camera data. 2) We propose a spatial concatenation strategy for four-view videos of independent agents to model a broader environment and to ensure internal multi-view geometric consistency. 3) We integrate cross-agent attention blocks into the pretrained video model, which enable interactive transmission of spatial-temporal information across agents, guaranteeing shared world consistency in overlapping regions and reasonable generation in non-overlapping regions. ShareVerse, which supports 49-frame large-scale video generation, accurately perceives the position of dynamic agents and achieves consistent shared world modeling.

Computer Vision Multimodal Models World Models & Planning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling

Related Papers