BiometricsAIFeb 17, 2026arXiv:2602.15531

GenAI-LA: Generative AI and Learning Analytics Workshop (LAK 2026), April 27--May 1, 2026, Bergen, Norway

Javier Irigoyen, Roberto Daza, Aythami Morales, Julian Fierrez, Francisco Jurado, Alvaro Ortigosa, Ruben Tolosana

AI Summary

The paper introduces EduEVAL-DB, a dataset of 854 explanations for 139 ScienceQA questions, comprising human-teacher explanations and LLM-generated explanations simulating various instructional styles and shortcomings. A pedagogical risk rubric with five dimensions (factual correctness, explanatory depth, relevance, appropriateness, and bias) is proposed and used to annotate the dataset with binary risk labels via expert teacher review. Experiments benchmark Gemini 2.5 Pro and Llama 3.1 8B, demonstrating the potential of EduEVAL-DB for fine-tuning smaller models for pedagogical risk detection.

Key Contribution

Fine-tuning a small Llama 3.1 model on the new EduEVAL-DB dataset allows it to rival the pedagogical risk detection capabilities of a much larger model (Gemini 2.5 Pro), suggesting a path to effective and efficient AI tutoring tools.

Abstract

This work introduces EduEVAL-DB, a dataset based on teacher roles designed to support the evaluation and training of automatic pedagogical evaluators and AI tutors for instructional explanations. The dataset comprises 854 explanations corresponding to 139 questions from a curated subset of the ScienceQA benchmark, spanning science, language, and social science across K-12 grade levels. For each question, one human-teacher explanation is provided and six are generated by LLM-simulated teacher roles. These roles are inspired by instructional styles and shortcomings observed in real educational practice and are instantiated via prompt engineering. We further propose a pedagogical risk rubric aligned with established educational standards, operationalizing five complementary risk dimensions: factual correctness, explanatory depth and completeness, focus and relevance, student-level appropriateness, and ideological bias. All explanations are annotated with binary risk labels through a semi-automatic process with expert teacher review. Finally, we present preliminary validation experiments to assess the suitability of EduEVAL-DB for evaluation. We benchmark a state-of-the-art education-oriented model (Gemini 2.5 Pro) against a lightweight local Llama 3.1 8B model and examine whether supervised fine-tuning on EduEVAL-DB supports pedagogical risk detection using models deployable on consumer hardware.

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

GenAI-LA: Generative AI and Learning Analytics Workshop (LAK 2026), April 27--May 1, 2026, Bergen, Norway

Related Papers