KITShenzhen UniversityMar 10, 2026arXiv:2603.09573

More than the Sum: Panorama-Language Models for Adverse Omni-Scenes

Weijia Fan, Ruiping Liu, Jiale Wei, Yufan Chen, Junwei Zheng, Zichao Zeng, Jiaming Zhang, Qiufu Li, Linlin Shen, Rainer Stiefelhagen

AI Summary

This paper introduces Panorama-Language Modeling (PLM), a new paradigm for 360° vision-language reasoning that leverages the holistic spatial and contextual relationships inherent in panoramic images. To facilitate this, the authors created PanoVQA, a large-scale panoramic VQA dataset focused on adverse omni-scenes with occlusions and driving accidents. They also developed a plug-and-play panoramic sparse attention module, enabling existing VLMs to process equirectangular panoramas without retraining, and demonstrated improved robustness and holistic reasoning compared to pinhole-based approaches.

Key Contribution

Panoramic vision-language models can achieve a level of holistic scene understanding and robustness in adverse conditions that's impossible for traditional pinhole-based VLMs.

Abstract

Existing vision-language models (VLMs) are tailored for pinhole imagery, stitching multiple narrow field-of-view inputs to piece together a complete omni-scene understanding. Yet, such multi-view perception overlooks the holistic spatial and contextual relationships that a single panorama inherently preserves. In this work, we introduce the Panorama-Language Modeling (PLM)paradigm, a unified $360^\circ$ vision-language reasoning that is more than the sum of its pinhole counterparts. Besides, we present PanoVQA, a large-scale panoramic VQA dataset that involves adverse omni-scenes, enabling comprehensive reasoning under object occlusions and driving accidents. To establish a foundation for PLM, we develop a plug-and-play panoramic sparse attention module that allows existing pinhole-based VLMs to process equirectangular panoramas without retraining. Extensive experiments demonstrate that our PLM achieves superior robustness and holistic reasoning under challenging omni-scenes, yielding understanding greater than the sum of its narrow parts. Project page: https://github.com/InSAI-Lab/PanoVQA.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

More than the Sum: Panorama-Language Models for Adverse Omni-Scenes

Related Papers