Tsinghua AIMar 16, 2026arXiv:2603.14941

RS-WorldModel: a Unified Model for Remote Sensing Understanding and Future Sense Forecasting

Linrui Xu, Zhongan Wang, Zhongan Wang, Fei Shen, Feiyu Shen, Gang Xu, Gang Xu, Huiping Zhuang, Huiping Zhuang, Ming Li, Ming Li, Haifeng Li, Haifeng Li

AI Summary

The paper introduces RS-WorldModel, a unified world model for remote sensing that jointly addresses spatiotemporal change understanding and text-guided future scene forecasting. To train the model, the authors created RSWBench-1.1M, a large-scale dataset with rich language annotations, and employed a three-stage training process involving geo-aware pre-training, synergistic instruction tuning, and verifiable reinforcement optimization. The resulting 2B parameter model outperforms significantly larger open-source models on spatiotemporal change question-answering and achieves state-of-the-art performance on text-guided future scene forecasting, even surpassing the closed-source Gemini-2.5-Flash Image.

Key Contribution

A 2B parameter model trained on a new 1.1M dataset can now forecast remote sensing scenes better than Gemini-2.5-Flash Image, suggesting that task-specific training data and methods can beat sheer scale.

Abstract

Remote sensing world models aim to both explain observed changes and forecast plausible futures, two tasks that share spatiotemporal priors. Existing methods, however, typically address them separately, limiting cross-task transfer. We present RS-WorldModel, a unified world model for remote sensing that jointly handles spatiotemporal change understanding and text-guided future scene forecasting, and we build RSWBench-1.1M, a 1.1 million sample dataset with rich language annotations covering both tasks. RS-WorldModel is trained in three stages: (1) Geo-Aware Generative Pre-training (GAGP) conditions forecasting on geographic and acquisition metadata; (2) synergistic instruction tuning (SIT) jointly trains understanding and forecasting; (3) verifiable reinforcement optimization (VRO) refines outputs with verifiable, task-specific rewards. With only 2B parameters, RS-WorldModel surpasses open-source models up to 120$ \times $ larger on most spatiotemporal change question-answering metrics. It achieves an FID of 43.13 on text-guided future scene forecasting, outperforming all open-source baselines as well as the closed-source Gemini-2.5-Flash Image (Nano Banana).

Computer Vision Multimodal Models World Models & Planning

Citation Metrics

Citations0

Influential citations0

References62

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

RS-WorldModel: a Unified Model for Remote Sensing Understanding and Future Sense Forecasting

Related Papers