BITApr 6, 2026arXiv:2604.04791

How Far Are We? Systematic Evaluation of LLMs vs. Human Experts in Mathematical Contest in Modeling

Yuhang Liu, Heyan Huang, Yizhe Yang, Hongyan Zhao, Zhizhuo Zeng, Yang Gao

AI Summary

This paper introduces a novel stage-wise evaluation framework for assessing LLMs on mathematical modeling problems, validated against human expert judgments. The framework reveals a significant "comprehension-execution gap" in LLMs, where performance degrades in later stages like model solving and code implementation, even with larger models. The authors attribute these failures to issues like insufficient specification, missing verification, and lack of validation, highlighting the need for improvements beyond simply scaling model size.

Key Contribution

LLMs ace the setup but fumble the execution in mathematical modeling, revealing a critical gap that scaling alone won't fix.

Abstract

Large language models (LLMs) have achieved strong performance on reasoning benchmarks, yet their ability to solve real-world problems requiring end-to-end workflows remains unclear. Mathematical modeling competitions provide a stringent testbed for evaluating such end-to-end problem-solving capability. We propose a problem-oriented, stage-wise evaluation framework that assesses LLM performance across modeling stages using expert-verified criteria. We validate the framework's reliability by comparing automatic scores with independent human expert judgments on problems from the China Postgraduate Mathematical Contest in Modeling, demonstrating substantially stronger alignment than existing evaluation schemes. Using this framework, we reveal a comprehension-execution gap in state-of-the-art LLMs: while they perform well in early stages such as problem identification and formulation, they exhibit persistent deficiencies in execution-oriented stages including model solving, code implementation, and result analysis. These gaps persist even with increased model scale. We further trace these failures to insufficient specification, missing verification, and lack of validation, with errors propagating across stages without correction. Our findings suggest that bridging this gap requires approaches beyond model scaling, offering insights for applying LLMs to complex real-world problem solving.

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

How Far Are We? Systematic Evaluation of LLMs vs. Human Experts in Mathematical Contest in Modeling

Related Papers