Cloud AIKAISTNAVER LabsJun 9, 2026arXiv:2606.10403

KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty

AI Summary

The paper introduces KCSAT-ML, a benchmark comprising 664 mathematics problems from the Korean College Scholastic Ability Test, enriched with per-item difficulty metrics based on actual human performance. By employing the Difficulty-aligned Reasoning Gain (DRG) metric, the authors reveal critical insights into how various reasoning models perform relative to human difficulty perceptions, uncovering patterns of alignment failure in model behavior. The findings indicate that models exhibit significant discrepancies in error patterns, with some misclassifying difficult items while others struggle with easier ones, highlighting the limitations of aggregate accuracy measures.

Key Contribution

Models can achieve similar accuracy while exhibiting starkly different reasoning failures, revealing a hidden complexity in AI performance that aggregate metrics overlook.

Abstract

Math reasoning benchmarks have proliferated, yet most lack a per-item difficulty signal grounded in actual human performance. We introduce KCSAT-ML, a decade (2014-2025) of Korean College Scholastic Ability Test (KCSAT; Suneung) mathematics: 664 problems with a 339-item core set carrying official per-item error rates from nationwide cohorts of hundreds of thousands of examinees. We pair the benchmark with Difficulty-aligned Reasoning Gain (DRG): a score-orthogonal metric that asks whether a model's mistakes concentrate on the items humans found hard, or on items humans found easy. Together they expose, across a wide range of VLMs (and LLMs via OCR), three patterns: (i) low-budget accuracy collapses on the high-human-error tail at every model size; (ii) test-time scaling (TTS) raises token use roughly linearly with cohort error rate, while accuracy gains follow a non-monotonic curve; (iii) within a single family, TTS flips between anti-scaling on the hardest items and overthinking on easier ones -- two faces of the same alignment failure. On DRG, models with near-identical accuracy can sit at near-opposite values: one model gets wrong what humans also find hard, while another solves the hardest items yet fails on items humans find easy -- a contrast that aggregate accuracy hides. Our code and dataset builder will be open-sourced at https://github.com/naver-ai/KCSAT-ML.

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

KCSAT-ML: Probing Reasoning Models with Nationwide-Cohort Human Difficulty

Related Papers