May 11 – May 18, 2026

RLHF & Preference Learning - Weekly Roundup

2 papers published across 0 labs.

1500% acceleration

Top Papers

May 18, 2026

1w ago·also DeepAuto.ai

HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents

Stop wasting compute on irrelevant actions: targeted hindsight self-distillation focuses LLM agent training on the critical failure points, boosting performance and slashing training time.

Woongyeng Yeo, Yumin Choi, Taekyung Ki +1

RLHF & Preference Learning Tool Use & Agents Training Efficiency & Optimization

May 11, 2026

2w ago

Unsupervised Process Reward Models

Forget expensive human annotations: this unsupervised method trains reward models that steer LLM reasoning just as well as, or even better than, their supervised counterparts.

Artyom Gadetsky, M. Kodryan, Siba Smarak Panigrahi +2

Reasoning & Chain-of-Thought RLHF & Preference Learning

Search

RLHF & Preference Learning - Weekly Roundup

Top Papers