Search papers, labs, and topics across Lattice.
This paper introduces UltraVR, a diagnostic benchmark designed to evaluate vision-language models (VLMs) on ultra-resolution images, where critical evidence may be subtle or spatially distant. By incorporating structured ground-truth chains of thought and detailed reasoning labels, UltraVR allows for a nuanced analysis of model performance across various challenging scenarios, including CCTV surveillance and industrial anomaly detection. The evaluation reveals that current VLMs struggle significantly with evidence grounding and local perception, highlighting specific weaknesses in their reasoning processes while showing potential recovery in downstream inference with additional visual context.
Current vision-language models falter in ultra-resolution reasoning, with errors primarily stemming from evidence grounding and local perception.
Vision-language models (VLMs) excel on visual question answering and multimodal reasoning benchmarks. Yet their capability on ultra-resolution images - where critical evidence is tiny, subtle, spatially distant, or distributed - remains unclear. Existing evaluations largely report final-answer accuracy, offering limited insight into whether models acquire and integrate the necessary visual evidence. We introduce UltraVR, a diagnostic benchmark for evidence-grounded visual reasoning over ultra-resolution images. UltraVR spans four high-value scenarios: CCTV surveillance, remote sensing (RS), whole-slide image (WSI) pathology, and industrial anomaly detection (AD). These domains pose complementary challenges: fine-grained object grounding in crowded CCTV scenes, long-range spatial comparison in RS, multi-scale evidence navigation in WSI, and subtle irregularity detection in repetitive industrial layouts. Beyond standard QA triples, each instance includes a structured ground-truth chain of thought with step-level questions, intermediate answers, and reasoning labels. These labels decompose reasoning into evidence grounding, local perception, quantification, evidence integration, and decision inference, enabling process-level diagnosis over black-box scoring. Using UltraVR, we evaluate frontier VLMs and show that current models remain far from reliable on ultra-resolution reasoning. Importantly, the structured annotations allow us to localize failures across the visual-to-decision pipeline: errors concentrate in evidence grounding and local perception, while downstream inference often recovers when intermediate visual facts are supplied. These findings demonstrate UltraVR as a diagnostic testbed for measuring not only whether VLMs answer correctly, but where their ultra-resolution reasoning process breaks.