Mar 1, 2026arXiv:2603.00925

The Aftermath of DrawEduMath: Vision Language Models Underperform with Struggling Students and Misdiagnose Errors

Li Lucy, Albert Zhang, Nathan Anderson, Ryan Knight, Kyle Lo

AI Summary

This paper evaluates the performance of 11 vision-language models (VLMs) on the DrawEduMath dataset, a QA benchmark of real students' handwritten math work. The study reveals that VLMs exhibit significantly reduced accuracy when analyzing work from struggling students and specifically struggle with questions related to identifying and describing student errors. This highlights a critical gap in VLMs' ability to support educational applications, as they are less effective at analyzing the work of students who need the most assistance.

Key Contribution

VLMs that ace math problems still flunk at understanding *how* students go wrong, highlighting a critical gap for AI in education.

Abstract

Effective mathematics education requires identifying and responding to students' mistakes. For AI to support pedagogical applications, models must perform well across different levels of student proficiency. Our work provides an extensive, year-long snapshot of how 11 vision-language models (VLMs) perform on DrawEduMath, a QA benchmark involving real students' handwritten, hand-drawn responses to math problems. We find that models' weaknesses concentrate on a core component of math education: student error. All evaluated VLMs underperform when describing work from students who require more pedagogical help, and across all QA, they struggle the most on questions related to assessing student error. Thus, while VLMs may be optimized to be math problem solving experts, our results suggest that they require alternative development incentives to adequately support educational use cases.

Eval Frameworks & Benchmarks Multimodal Models Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

The Aftermath of DrawEduMath: Vision Language Models Underperform with Struggling Students and Misdiagnose Errors

Related Papers