Search papers, labs, and topics across Lattice.
KMP-Bench, a new benchmark, assesses LLMs' pedagogical capabilities in K-8 mathematics using two modules: KMP-Dialogue, which evaluates holistic teaching principles, and KMP-Skills, which assesses granular tutoring abilities like problem-solving and error correction. Evaluations reveal that while LLMs perform well on tasks with verifiable solutions, they struggle with nuanced pedagogical principles. Fine-tuning models on KMP-Pile, a 150K dialogue dataset, significantly improves performance on KMP-Bench, highlighting the importance of pedagogically-rich training data.
LLMs can ace the math problems, but still flunk as tutors, struggling with nuanced pedagogical principles like giving effective feedback.
Large Language Models (LLMs) show significant potential in AI mathematical tutoring, yet current evaluations often rely on simplistic metrics or narrow pedagogical scenarios, failing to assess comprehensive, multi-turn teaching effectiveness. In this paper, we introduce KMP-Bench, a comprehensive K-8 Mathematical Pedagogical Benchmark designed to assess LLMs from two complementary perspectives. The first module, KMP-Dialogue, evaluates holistic pedagogical capabilities against six core principles (e.g., Challenge, Explanation, Feedback), leveraging a novel multi-turn dialogue dataset constructed by weaving together diverse pedagogical components. The second module, KMP-Skills, provides a granular assessment of foundational tutoring abilities, including multi-turn problem-solving, error detection and correction, and problem generation. Our evaluations on KMP-Bench reveal a key disparity: while leading LLMs excel at tasks with verifiable solutions, they struggle with the nuanced application of pedagogical principles. Additionally, we present KMP-Pile, a large-scale (150K) dialogue dataset. Models fine-tuned on KMP-Pile show substantial improvement on KMP-Bench, underscoring the value of pedagogically-rich training data for developing more effective AI math tutors.