Stanford HAIApr 14, 2026arXiv:2604.13356

Peer-Predictive Self-Training for Language Model Reasoning

Shi Feng, Fan Nie, Fan Nie, Sham M. Kakade, Sham Kakade, Yiling Chen

AI Summary

Peer-Predictive Self-Training (PST) is introduced, a novel label-free fine-tuning framework where multiple language models collaboratively improve by using a cross-model aggregated response as a self-generated training signal. PST scales self-training updates based on the pointwise mutual information (PMI) between each intermediate response and the aggregate, emphasizing updates for less informative or misaligned responses. Experiments on mathematical reasoning benchmarks show that PST improves exact-match accuracy by 2.2 to 4.3 percentage points and significantly reduces the generator-verifier gap across multiple models.

Key Contribution

Language models can bootstrap their reasoning abilities without human labels by learning from each other's aggregated answers, achieving significant gains in mathematical reasoning.

Abstract

Mechanisms for continued self-improvement of language models without external supervision remain an open challenge. We propose Peer-Predictive Self-Training (PST), a label-free fine-tuning framework in which multiple language models improve collaboratively by leveraging a cross-model aggregated response as an internal training signal. Given a prompt question, the models generate responses sequentially; the final aggregated answer, often more reliable than individual responses in practice, serves as an internal target for learning. We measure how informative each intermediate response is about the aggregate using pointwise mutual information (PMI), and use this signal to scale self-training updates. Responses already aligned with the aggregate are updated less, while less informative or misaligned responses are updated more. On mathematical reasoning benchmarks (SimulEq, Math500, and MultiArith), PST improves exact-match accuracy by 2.2 to 4.3 percentage points across Gemma-2-2B, LLaMA-3.2-1B, and Qwen-2.5-1.5B, and reduces the average generator-verifier gap (GV-Gap) by 26 to 40 percent, while requiring no external supervision or teacher-student hierarchy and relying solely on cross-model interactions. These results suggest that cross-model generations and peer-predictive feedback can serve as an effective approach for self-supervised training.

Natural Language Processing Reasoning & Chain-of-Thought Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Peer-Predictive Self-Training for Language Model Reasoning

Related Papers