Apr 20, 2026arXiv:2604.18574

When Can LLMs Learn to Reason with Weak Supervision?

Salman Rahman, Jingyan Shen, Anna Mordvina, Hamid Palangi, Saadia Gabriel, Pavel Izmailov

AI Summary

This study investigates the conditions under which large language models (LLMs) can effectively learn to reason using weak supervision, focusing on scenarios with scarce data, noisy rewards, and self-supervised proxy rewards. The authors reveal that successful generalization is linked to the dynamics of training reward saturation, where models that maintain a prolonged pre-saturation phase can learn effectively, while those that saturate quickly tend to memorize. By identifying reasoning faithfulness as a key predictor of model performance and demonstrating the necessity of supervised fine-tuning on explicit reasoning traces, the authors show that combining these strategies can enhance generalization in Llama3.2-3B-Base across all tested weak supervision settings.

Key Contribution

Generalization in LLMs hinges on training reward saturation dynamics, with reasoning faithfulness emerging as a critical predictor of success under weak supervision.

Abstract

Large language models have achieved significant reasoning improvements through reinforcement learning with verifiable rewards (RLVR). Yet as model capabilities grow, constructing high-quality reward signals becomes increasingly difficult, making it essential to understand when RLVR can succeed under weaker forms of supervision. We conduct a systematic empirical study across diverse model families and reasoning domains under three weak supervision settings: scarce data, noisy rewards, and self-supervised proxy rewards. We find that generalization is governed by training reward saturation dynamics: models that generalize exhibit a prolonged pre-saturation phase during which training reward and downstream performance climb together, while models that saturate rapidly memorize rather than learn. We identify reasoning faithfulness, defined as the extent to which intermediate steps logically support the final answer, as the pre-RL property that predicts which regime a model falls into, while output diversity alone is uninformative. Motivated by these findings, we disentangle the contributions of continual pre-training and supervised fine-tuning, finding that SFT on explicit reasoning traces is necessary for generalization under weak supervision, while continual pre-training on domain data amplifies the effect. Applied together to Llama3.2-3B-Base, these interventions enable generalization across all three settings where the base model previously failed.

Data Curation & Synthetic Data Reasoning & Chain-of-Thought RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

When Can LLMs Learn to Reason with Weak Supervision?

Related Papers