Search papers, labs, and topics across Lattice.
The paper introduces EduEVAL-DB, a dataset of 854 explanations for 139 ScienceQA questions, comprising human-teacher explanations and LLM-generated explanations simulating various instructional styles and shortcomings. A pedagogical risk rubric with five dimensions (factual correctness, explanatory depth, relevance, appropriateness, and bias) is proposed and used to annotate the dataset with binary risk labels via expert teacher review. Experiments benchmark Gemini 2.5 Pro and Llama 3.1 8B, demonstrating the potential of EduEVAL-DB for fine-tuning smaller models for pedagogical risk detection.
Fine-tuning a small Llama 3.1 model on the new EduEVAL-DB dataset allows it to rival the pedagogical risk detection capabilities of a much larger model (Gemini 2.5 Pro), suggesting a path to effective and efficient AI tutoring tools.
This work introduces EduEVAL-DB, a dataset based on teacher roles designed to support the evaluation and training of automatic pedagogical evaluators and AI tutors for instructional explanations. The dataset comprises 854 explanations corresponding to 139 questions from a curated subset of the ScienceQA benchmark, spanning science, language, and social science across K-12 grade levels. For each question, one human-teacher explanation is provided and six are generated by LLM-simulated teacher roles. These roles are inspired by instructional styles and shortcomings observed in real educational practice and are instantiated via prompt engineering. We further propose a pedagogical risk rubric aligned with established educational standards, operationalizing five complementary risk dimensions: factual correctness, explanatory depth and completeness, focus and relevance, student-level appropriateness, and ideological bias. All explanations are annotated with binary risk labels through a semi-automatic process with expert teacher review. Finally, we present preliminary validation experiments to assess the suitability of EduEVAL-DB for evaluation. We benchmark a state-of-the-art education-oriented model (Gemini 2.5 Pro) against a lightweight local Llama 3.1 8B model and examine whether supervised fine-tuning on EduEVAL-DB supports pedagogical risk detection using models deployable on consumer hardware.