Apr 1, 2026arXiv:2604.00912

ProCap: Projection-Aware Captioning for Spatial Augmented Reality

Zimo Cao, Yuchen Deng, Haibin Ling, Bingyao Huang

AI Summary

The paper introduces ProCap, a framework designed to improve VLMs' understanding of scenes in spatial augmented reality (SAR) by explicitly decoupling projected content from the physical scene. ProCap uses a two-stage pipeline involving automated segmentation to isolate virtual and physical layers, followed by region-aware retrieval to mitigate semantic ambiguity caused by projection distortion. They also introduce RGBP, a large-scale SAR semantic benchmark dataset with dense annotations, and a dual-captioning evaluation protocol to independently assess physical scene and projection descriptions.

Key Contribution

VLMs get confused by spatial augmented reality, but ProCap's two-stage decoupling pipeline and new RGBP dataset could finally let them tell the difference between real and projected objects.

Abstract

Spatial augmented reality (SAR) directly projects digital content onto physical scenes using projectors, creating immersive experience without head-mounted displays. However, for SAR to support intelligent interaction, such as reasoning about the scene or answering user queries, it must semantically distinguish between the physical scene and the projected content. Standard Vision Language Models (VLMs) struggle with this virtual-physical ambiguity, often confusing the two contexts. To address this issue, we introduce ProCap, a novel framework that explicitly decouples projected content from physical scenes. ProCap employs a two-stage pipeline: first it visually isolates virtual and physical layers via automated segmentation; then it uses region-aware retrieval to avoid ambiguous semantic context due to projection distortion. To support this, we present RGBP (RGB + Projections), the first large-scale SAR semantic benchmark dataset, featuring 65 diverse physical scenes and over 180,000 projections with dense, decoupled annotations. Finally, we establish a dual-captioning evaluation protocol using task-specific tokens to assess physical scene and projection descriptions independently. Our experiments show that ProCap provides a robust semantic foundation for future SAR research. The source code, pre-trained models and the RGBP dataset are available on the project page: https://ZimoCao.github.io/ProCap/.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References68

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ProCap: Projection-Aware Captioning for Spatial Augmented Reality

Related Papers