Search papers, labs, and topics across Lattice.
This paper addresses the VRR-QA challenge by introducing a test-time reasoning pipeline that leverages a strong GPT-5.5 video QA solver alongside question-aware evidence ledgers. By explicitly routing evidence sources to clarify spatial relations, event boundaries, and dialogue context, the approach enhances the model's ability to reason about complex video content. The resulting evidence-gated pipeline achieves an impressive 92.95% overall accuracy, demonstrating significant improvements in visual relational reasoning tasks.
Achieving nearly 93% accuracy in video relational reasoning, this approach reveals how structured evidence can dramatically enhance model performance in complex visual contexts.
The VRR-QA challenge evaluates visual relational reasoning in videos, where answers often depend on implicit spatial relations, event boundaries, target identity, and dialogue context rather than a single salient frame. We present a test-time reasoning pipeline built around a strong GPT-5.5 video QA solver and a set of question-aware evidence ledgers. The initial solver answers each question from a uniform video representation, while routed ledgers are prompted to make the required targets, count units, reference frames, and temporal or spatial scope explicit for counting, spatial, endpoint, viewpoint, and dialogue reasoning. External tools such as open-vocabulary detection, depth cues, pair crops, ASR, and scene-graph ledgers are used only as evidence sources. A conservative gate keeps the current answer unless independent evidence uniquely supports a different option. The final evidence-gated pipeline achieves 92.95% overall accuracy and 93.79% macro accuracy on the challenge test split.