ECNUJun 24, 2026arXiv:2606.25491

HG-Bench: A Benchmark for Multi-Page Handwritten Answer-Region Grounding in Automated Homework Assessment

Chuangxin Zhao, Boyan Shi, Yanling Wang, Yijian LU, Canran Xiao, Jiali Chen, Jun Xia, Yan Wang, Ji Qi, Juanzi Li

AI Summary

This paper introduces HG-Bench, a novel benchmark for evaluating multi-page handwritten answer-region grounding in automated homework assessment, addressing the critical need for accurate localization of answers and reasoning steps in noisy student submissions. The benchmark consists of 500 K-12 homework samples annotated with hierarchical answer and step-level regions, enabling a two-level evaluation of models' grounding capabilities. Results show that existing zero-shot systems struggle with this task, achieving only up to 55.22% on complete-answer localization, while a fine-tuned model significantly outperforms with 74.97%, highlighting a substantial capability gap in current approaches.

Key Contribution

No existing model can effectively ground the spatial structure of student reasoning in multi-page handwritten homework, revealing a significant gap in automated assessment capabilities.

Abstract

Automated homework assessment depends not only on recognizing student answers, but also on accurately locating where each answer and each intermediate reasoning step appears in noisy, multi-page handwritten work. This paper addresses the missing evaluation setting of page-aware, two-level answer-region grounding: given a sequence of homework page images, a model must localize complete answer regions and their ordered step-level subregions. We introduce HG-Bench, a benchmark of 500 human-annotated K-12 homework samples curated from a 1,489,278-image source pool, with question-level and step-level boxes linked by a hierarchical containment constraint. HG-Bench is paired with a page-aware evaluation protocol that separately measures complete-answer localization (FA) and step-level decomposition (FSm), revealing whether models truly ground the spatial structure of student reasoning rather than merely parse visible text. Across frontier closed-source APIs and competitive open-weight VLMs, no zero-shot system exceeds 55.22% on FA or 48.22% on FSm, while a GLM-4.6V 9B reference model fine-tuned on ~10k in-domain examples reaches 74.97/72.26. These results identify step-level handwritten grounding as a concrete capability gap and provide a reproducible benchmark, evaluation protocol, and trained reference point for future work on automated homework assessment.

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

HG-Bench: A Benchmark for Multi-Page Handwritten Answer-Region Grounding in Automated Homework Assessment

Related Papers