Search papers, labs, and topics across Lattice.
The paper introduces MMR-Bench, a benchmark designed to evaluate query-level model selection (routing) for multimodal large language models (MLLMs) under varying compute budgets. MMR-Bench includes modality-aware inputs, a diverse set of vision-language tasks (OCR, VQA, multimodal math), and reference baselines to facilitate standardized and budget-aware evaluation of MLLM routing policies. Experiments using MMR-Bench demonstrate that incorporating multimodal signals enhances routing quality, achieving superior cost-accuracy trade-offs and generalization to new datasets compared to single MLLM usage.
Stop wasting compute: MMR-Bench shows you can beat the best single multimodal LLM's accuracy for 33% of the cost by intelligently routing queries.
Multimodal large language models (MLLMs) have advanced rapidly, yet heterogeneity in architecture, alignment strategies, and efficiency means that no single model is uniformly superior across tasks. In practical deployments, workloads span lightweight OCR to complex multimodal reasoning; using one MLLM for all queries either over-provisions compute on easy instances or sacrifices accuracy on hard ones. Query-level model selection (routing) addresses this tension, but extending routing from text-only LLMs to MLLMs is nontrivial due to modality fusion, wide variation in computational cost across models, and the absence of a standardized, budget-aware evaluation. We present MMR-Bench, a unified benchmark that isolates the multimodal routing problem and enables comparison under fixed candidate sets and cost models. MMR-Bench provides (i) a controlled environment with modality-aware inputs and variable compute budgets, (ii) a broad suite of vision-language tasks covering OCR, general VQA, and multimodal math reasoning, and (iii) strong single-model reference, oracle upper bounds, and representative routing policies. Using MMR-Bench, we show that incorporating multimodal signals improves routing quality. Empirically, these cues improve the cost-accuracy frontier and enable the routed system to exceed the strongest single model's accuracy at roughly 33% of its cost. Furthermore, policies trained on a subset of models and tasks generalize zero-shot to new datasets and text-only benchmarks without retuning, establishing MMR-Bench as a foundation for studying adaptive multimodal model selection and efficient MLLM deployment. The code will be available at: https://github.com/Hunter-Wrynn/MMR-Bench.