Search papers, labs, and topics across Lattice.
The paper introduces OddGridBench, a benchmark to evaluate MLLMs' sensitivity to fine-grained visual discrepancies within grid-based images where one element differs in attributes like color, size, or position. Experiments show that current MLLMs, including state-of-the-art models like Gemini and GPT-5, perform significantly worse than humans at detecting these subtle visual differences. To address this, the authors propose OddGrid-GRPO, a reinforcement learning framework with curriculum learning and distance-aware rewards, which substantially improves MLLMs' fine-grained visual discrimination.
Even the most advanced MLLMs like GPT-5 and Gemini struggle to spot the "odd one out" in simple visual grids, revealing a surprising weakness in fine-grained visual perception.
Multimodal large language models (MLLMs) have achieved remarkable performance across a wide range of vision language tasks. However, their ability in low-level visual perception, particularly in detecting fine-grained visual discrepancies, remains underexplored and lacks systematic analysis. In this work, we introduce OddGridBench, a controllable benchmark for evaluating the visual discrepancy sensitivity of MLLMs. OddGridBench comprises over 1,400 grid-based images, where a single element differs from all others by one or multiple visual attributes such as color, size, rotation, or position. Experiments reveal that all evaluated MLLMs, including open-source families such as Qwen3-VL and InternVL3.5, and proprietary systems like Gemini-2.5-Pro and GPT-5, perform far below human levels in visual discrepancy detection. We further propose OddGrid-GRPO, a reinforcement learning framework that integrates curriculum learning and distance-aware reward. By progressively controlling the difficulty of training samples and incorporating spatial proximity constraints into the reward design, OddGrid-GRPO significantly enhances the model's fine-grained visual discrimination ability. We hope OddGridBench and OddGrid-GRPO will lay the groundwork for advancing perceptual grounding and visual discrepancy sensitivity in multimodal intelligence. Code and dataset are available at https://wwwtttjjj.github.io/OddGridBench/.