MIT CSAILApr 20, 2026arXiv:2604.18584

MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

Shaden Alshammari, Shaden Alshammari, Kevin Wen, K. Wen, Abrar Zainal, Abrar Zainal, Mark Hamilton, Mark Hamilton, Navid Safaei, Navid Safaei, Sultan Albarakati, Sultan Albarakati, William T. Freeman, William T. Freeman, Antonio Torralba, Antonio Torralba

AI Summary

MathNet is introduced as a large-scale, multimodal, and multilingual dataset of 30,676 Olympiad-level math problems spanning 47 countries and 17 languages, designed to benchmark mathematical reasoning and retrieval. The dataset includes a retrieval benchmark of mathematically equivalent problems, enabling evaluation of retrieval-augmented problem solving. Experiments reveal that even state-of-the-art models like Gemini-3.1-Pro (78.4%) and GPT-5 (69.3%) struggle, and retrieval-augmented generation is highly sensitive to retrieval quality, with DeepSeek-V3.2-Speciale achieving the highest benchmark scores with up to 12% gains.

Key Contribution

Even the best LLMs still stumble on Olympiad-level math, and retrieval quality is the bottleneck for retrieval-augmented problem solving, according to the new MathNet benchmark.

Abstract

Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce MathNet, a high-quality, large-scale, multimodal, and multilingual dataset of Olympiad-level math problems together with a benchmark for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems. MathNet spans 47 countries, 17 languages, and two decades of competitions, comprising 30,676 expert-authored problems with solutions across diverse domains. In addition to the core dataset, we construct a retrieval benchmark consisting of mathematically equivalent and structurally similar problem pairs curated by human experts. MathNet supports three tasks: (i) Problem Solving, (ii) Math-Aware Retrieval, and (iii) Retrieval-Augmented Problem Solving. Experimental results show that even state-of-the-art reasoning models (78.4% for Gemini-3.1-Pro and 69.3% for GPT-5) remain challenged, while embedding models struggle to retrieve equivalent problems. We further show that retrieval-augmented generation performance is highly sensitive to retrieval quality; for example, DeepSeek-V3.2-Speciale achieves gains of up to 12%, obtaining the highest scores on the benchmark. MathNet provides the largest high-quality Olympiad dataset together with the first benchmark for evaluating mathematical problem retrieval, and we publicly release both the dataset and benchmark at https://mathnet.mit.edu.

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations1

Influential citations0

References41

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

Related Papers