Mar 30, 2026arXiv:2603.28537

Training data generation for context-dependent rubric-based short answer grading

Pavel vSindel'avr, Pavel Šindelář, Dávid Slivka, David Slivka, Christopher Bouma, C. Bouma, Filip Prášil, Filip Pr'avsil, O. Bojar, Ondřej Bojar

AI Summary

This paper investigates methods for generating synthetic training data for automatic short answer grading, specifically for the PISA test. They explore techniques that leverage a small, confidential dataset to create larger surrogate datasets while preserving confidentiality through simple text format transformations. The generated datasets show superficial similarity to the reference data and initial experiments suggest one approach improves model training compared to prompt-based generation alone.

Key Contribution

Generating synthetic training data from limited confidential datasets can produce datasets that are superficially similar to the reference data and improve model training for short answer grading.

Abstract

Every 4 years, the PISA test is administered by the OECD to test the knowledge of teenage students worldwide and allow for comparisons of educational systems. However, having to avoid language differences and annotator bias makes the grading of student answers challenging. For these reasons, it would be interesting to compare methods of automatic student answer grading. To train some of these methods, which require machine learning, or to compute parameters or select hyperparameters for those that do not, a large amount of domain-specific data is needed. In this work, we explore a small number of methods for creating a large-scale training dataset using only a relatively small confidential dataset as a reference, leveraging a set of very simple derived text formats to preserve confidentiality. Using these methods, we successfully created three surrogate datasets that are, at the very least, superficially more similar to the reference dataset than purely the result of prompt-based generation. Early experiments suggest one of these approaches might also lead to improved model training.

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References5

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Training data generation for context-dependent rubric-based short answer grading

Related Papers