HKUJHUNortheasternRiceApr 2, 2026arXiv:2604.02323

Beyond Referring Expressions: Scenario Comprehension Visual Grounding

Ruozhen He, Ruozhen He, Nisarg A. Shah, Nisarg A. Shah, Qihua Dong, Zilin Xiao, Zilin Xiao, Jaywon Koo, Jaywon Koo, Vicente Ordonez, Vicente Ordonez

AI Summary

The paper introduces Referring Scenario Comprehension (RSC), a new visual grounding benchmark that requires models to infer targets from scenario descriptions involving roles, intentions, and relational context, rather than relying on explicit naming. RSC includes interpretable difficulty tags to expose model failure modes and an out-of-distribution split with unseen object categories. They also propose ScenGround, a curriculum reasoning method combining supervised warm-starting and difficulty-aware reinforcement learning, demonstrating improved performance on challenging RSC slices and transfer to standard benchmarks.

Key Contribution

Current visual grounding models struggle to infer objects from contextual roles and intentions, highlighting a critical gap in their ability to perform true scene understanding.

Abstract

Existing visual grounding benchmarks primarily evaluate alignment between image regions and literal referring expressions, where models can often succeed by matching a prominent named category. We explore a complementary and more challenging setting of scenario-based visual grounding, where the target must be inferred from roles, intentions, and relational context rather than explicit naming. We introduce Referring Scenario Comprehension (RSC), a benchmark designed for this setting. The queries in this benchmark are paragraph-length texts that describe object roles, user goals, and contextual cues, including deliberate references to distractor objects that often require deep understanding to resolve. Each instance is annotated with interpretable difficulty tags for uniqueness, clutter, size, overlap, and position which expose distinct failure modes and support fine-grained analysis. RSC contains approximately 31k training examples, 4k in-domain test examples, and a 3k out-of-distribution split with unseen object categories. We further propose ScenGround, a curriculum reasoning method serving as a reference point for this setting, combining supervised warm-starting with difficulty-aware reinforcement learning. Experiments show that scenario-based queries expose systematic failures in current models that standard benchmarks do not reveal, and that curriculum training improves performance on challenging slices and transfers to standard benchmarks.

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References39

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Beyond Referring Expressions: Scenario Comprehension Visual Grounding

Related Papers