Search papers, labs, and topics across Lattice.
The paper introduces Referring Scenario Comprehension (RSC), a new visual grounding benchmark that requires models to infer targets from scenario descriptions involving roles, intentions, and relational context, rather than relying on explicit naming. RSC includes interpretable difficulty tags to expose model failure modes and an out-of-distribution split with unseen object categories. They also propose ScenGround, a curriculum reasoning method combining supervised warm-starting and difficulty-aware reinforcement learning, demonstrating improved performance on challenging RSC slices and transfer to standard benchmarks.
Current visual grounding models struggle to infer objects from contextual roles and intentions, highlighting a critical gap in their ability to perform true scene understanding.
Existing visual grounding benchmarks primarily evaluate alignment between image regions and literal referring expressions, where models can often succeed by matching a prominent named category. We explore a complementary and more challenging setting of scenario-based visual grounding, where the target must be inferred from roles, intentions, and relational context rather than explicit naming. We introduce Referring Scenario Comprehension (RSC), a benchmark designed for this setting. The queries in this benchmark are paragraph-length texts that describe object roles, user goals, and contextual cues, including deliberate references to distractor objects that often require deep understanding to resolve. Each instance is annotated with interpretable difficulty tags for uniqueness, clutter, size, overlap, and position which expose distinct failure modes and support fine-grained analysis. RSC contains approximately 31k training examples, 4k in-domain test examples, and a 3k out-of-distribution split with unseen object categories. We further propose ScenGround, a curriculum reasoning method serving as a reference point for this setting, combining supervised warm-starting with difficulty-aware reinforcement learning. Experiments show that scenario-based queries expose systematic failures in current models that standard benchmarks do not reveal, and that curriculum training improves performance on challenging slices and transfers to standard benchmarks.