NorthwesternMay 28, 2026arXiv:2605.29582

PEARL: Training Socratic Tutors with Pedagogically Aligned Reinforcement Learning

Qikai Chang, Zhenrong Zhang, Linbo Chen, Pengfei Hu, Jianshu Zhang, Youhui Guo, Jun Du

AI Summary

PEARL is a reinforcement learning framework for training LLMs as Socratic tutors, addressing challenges in student simulation, reward modeling, and multi-objective optimization. It uses a controllable student simulator with decoupled cognitive states, a generative reward model for pedagogical quality and correctness, and a stable multi-objective RL scheme with discretized rewards and normalized advantages. Experiments show PEARL achieves state-of-the-art performance among open-source models and is competitive with proprietary LLMs, despite using a smaller 30B policy model.

Key Contribution

Socratic tutors can be effectively trained via RL by decoupling student cognitive states, using generative pedagogical rewards, and stabilizing multi-objective optimization.

Abstract

Large Language Models (LLMs) have shown promise as educational tutors, yet effective tutoring requires more than solving problems: it must provide progressive Socratic guidance and balance multiple pedagogical objectives across multi-turn interactions. However, training such tutors remains challenging due to limited-fidelity and weakly controllable student simulation, under-specified pedagogical reward modeling, and unstable multi-objective optimization. To overcome these limitations, we propose PEARL, a pedagogically aligned reinforcement learning framework for training Socratic tutoring agents, consisting of three key components. First, we introduce a controllable student simulator that decouples latent cognitive states from response generation to model diverse abilities and misconceptions. Second, we develop a generative reward model that jointly evaluates pedagogical quality and objective correctness for policy optimization. Finally, we propose a stable multi-objective RL scheme that discretizes rewards within each dimension and aggregates normalized advantages across dimensions, preventing high-variance objectives from dominating updates. Experiments on multiple benchmarks show that PEARL achieves the best performance among open-source models and remains competitive with leading proprietary LLMs, despite using only a 30B policy model.

Natural Language Processing RLHF & Preference Learning Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References39

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

PEARL: Training Socratic Tutors with Pedagogically Aligned Reinforcement Learning

Related Papers