Search papers, labs, and topics across Lattice.
This paper introduces HG-Bench, a novel benchmark for evaluating multi-page handwritten answer-region grounding in automated homework assessment, addressing the critical need for accurate localization of answers and reasoning steps in noisy student submissions. The benchmark consists of 500 K-12 homework samples annotated with hierarchical answer and step-level regions, enabling a two-level evaluation of models' grounding capabilities. Results show that existing zero-shot systems struggle with this task, achieving only up to 55.22% on complete-answer localization, while a fine-tuned model significantly outperforms with 74.97%, highlighting a substantial capability gap in current approaches.
No existing model can effectively ground the spatial structure of student reasoning in multi-page handwritten homework, revealing a significant gap in automated assessment capabilities.
Automated homework assessment depends not only on recognizing student answers, but also on accurately locating where each answer and each intermediate reasoning step appears in noisy, multi-page handwritten work. This paper addresses the missing evaluation setting of page-aware, two-level answer-region grounding: given a sequence of homework page images, a model must localize complete answer regions and their ordered step-level subregions. We introduce HG-Bench, a benchmark of 500 human-annotated K-12 homework samples curated from a 1,489,278-image source pool, with question-level and step-level boxes linked by a hierarchical containment constraint. HG-Bench is paired with a page-aware evaluation protocol that separately measures complete-answer localization (FA) and step-level decomposition (FSm), revealing whether models truly ground the spatial structure of student reasoning rather than merely parse visible text. Across frontier closed-source APIs and competitive open-weight VLMs, no zero-shot system exceeds 55.22% on FA or 48.22% on FSm, while a GLM-4.6V 9B reference model fine-tuned on ~10k in-domain examples reaches 74.97/72.26. These results identify step-level handwritten grounding as a concrete capability gap and provide a reproducible benchmark, evaluation protocol, and trained reference point for future work on automated homework assessment.