Search papers, labs, and topics across Lattice.
The paper introduces Multimodal and Multidimensional Item Response Theory (M3IRT) to decompose model ability and item difficulty in multimodal benchmarks into image-only, text-only, and cross-modal components. M3IRT identifies and prioritizes genuinely cross-modal questions, filtering out shortcut questions that can be solved using a single modality. Experiments across 24 VLMs on three benchmarks demonstrate that M3IRT preserves ranking fidelity even with a significant proportion of low-quality items, enabling more efficient and reliable evaluation of cross-modal reasoning.
Current multimodal benchmarks are full of single-modality shortcuts, but this paper offers a way to prune them, yielding more reliable and efficient evaluations of true cross-modal reasoning.
Multimodal Large Language Models (MLLMs) have recently emerged as general architectures capable of reasoning over diverse modalities. Benchmarks for MLLMs should measure their ability for cross-modal integration. However, current benchmarks are filled with shortcut questions, which can be solved using only a single modality, thereby yielding unreliable rankings. For example, in vision-language cases, we can find the correct answer without either the image or the text. These low-quality questions unnecessarily increase the size and computational requirements of benchmarks. We introduce a multi-modal and multidimensional item response theory framework (M3IRT) that extends classical IRT by decomposing both model ability and item difficulty into image-only, text-only, and cross-modal components. M3IRT estimates cross-modal ability of MLLMs and each question's cross-modal difficulty, enabling compact, high-quality subsets that better reflect multimodal reasoning. Across 24 VLMs on three benchmarks, M3IRT prioritizes genuinely cross-modal questions over shortcuts and preserves ranking fidelity even when 50% of items are artificially generated low-quality questions, thereby reducing evaluation cost while improving reliability. M3IRT thus offers a practical tool for assessing cross-modal reasoning and refining multimodal benchmarks.