Search papers, labs, and topics across Lattice.
This paper introduces OMIBench, a novel benchmark specifically designed to evaluate Olympiad-level reasoning in large vision-language models (LVLMs) by leveraging contextual information across multiple images. The benchmark includes diverse problems from various scientific disciplines and provides annotated rationales and evaluation protocols for assessing model performance. Experimental results reveal significant performance gaps, with top models like Gemini-3-Pro achieving only around 50% accuracy, highlighting the need for improved multi-image reasoning capabilities in LVLMs.
Even the best large vision-language models struggle with multi-image reasoning, scoring only 50% on a new benchmark designed to challenge their capabilities.
Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal reasoning benchmarks for these models often emphasize single-image analysis and fail to exploit contextual information across multiple images. We present OMIBench, a benchmark designed to evaluate Olympiad-level reasoning when the required evidence is distributed over multiple images. It contains problems from biology, chemistry, mathematics, and physics Olympiads, together with manually annotated rationales and evaluation protocols for both exact and semantic answer matching. Across extensive experiments on OMIBench, we observe meaningful performance gaps in existing models. Even the strongest LVLMs, such as Gemini-3-Pro, attain only about 50% on the benchmark. These results position OMIBench as a focused resources for studying and improving multi-image reasoning in LVLMs.