May 1 – May 8, 2026

RLHF & Preference Learning - Weekly Roundup

28 papers published across 2 labs.

Selected Labs publishing this week

Top Papers

May 6, 2026

Alper Kamil Bozkurt +42w ago

Adaptive Policy Selection and Fine-Tuning under Interaction Budgets for Offline-to-Online Reinforcement Learning

Stop committing to a single policy in offline-to-online RL: adaptively select and fine-tune policies based on predicted performance to maximize returns under interaction budgets.

Alper Kamil Bozkurt, Xiaoan Xu, Shangtong Zhang +2

RLHF & Preference Learning Robotics & Embodied AI

UW2w ago·also SNU

Unified Framework of Distributional Regret in Multi-Armed Bandits and Reinforcement Learning

Distributional regret bounds, which quantify the probability of exceeding different regret levels, are now achievable with a UCBVI-style algorithm, confirming a long-standing conjecture for multi-armed bandits.

RLHF & Preference Learning Training Efficiency & Optimization

2w ago

Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization

Self-distillation can be more effective than learning from an external teacher, but only if you optimize for preference gaps instead of blindly matching the teacher's output distribution.

Xin Yu, Liuchen Liao, Yiwen Zhang +3

Inference & Quantization RLHF & Preference Learning Training Efficiency & Optimization

2w ago

EP-GRPO: Entropy-Progress Aligned Group Relative Policy Optimization with Implicit Process Guidance

GRPO's credit assignment failures—treating all tokens as equally important and misaligning step-level rewards—can be overcome with a self-supervised approach that mines the model's intrinsic information flow.

Song Yu, Li Li, Wenwen Zhao +1

Reasoning & Chain-of-Thought RLHF & Preference Learning

Hong Kong JC STEM Lab of Smart City2w ago·also Fudan, HKU, HUST, Lingnan University +2

Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers

Finally, a way to train LLM agents to reason step-by-step without needing humans to check every intermediate thought.

Senkang Hu, Yong Dai, Xudong Han +4

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

All Papers (28)

May 6, 2026

Alper Kamil Bozkurt +42w ago

Adaptive Policy Selection and Fine-Tuning under Interaction Budgets for Offline-to-Online Reinforcement Learning

Stop committing to a single policy in offline-to-online RL: adaptively select and fine-tune policies based on predicted performance to maximize returns under interaction budgets.

Alper Kamil Bozkurt, Xiaoan Xu, Shangtong Zhang +2

RLHF & Preference Learning Robotics & Embodied AI

UW2w ago·also SNU

Unified Framework of Distributional Regret in Multi-Armed Bandits and Reinforcement Learning

RLHF & Preference Learning Training Efficiency & Optimization

2w ago

Preference-Based Self-Distillation: Beyond KL Matching via Reward Regularization

Self-distillation can be more effective than learning from an external teacher, but only if you optimize for preference gaps instead of blindly matching the teacher's output distribution.

Xin Yu, Liuchen Liao, Yiwen Zhang +3

Inference & Quantization RLHF & Preference Learning Training Efficiency & Optimization

2w ago

EP-GRPO: Entropy-Progress Aligned Group Relative Policy Optimization with Implicit Process Guidance

Song Yu, Li Li, Wenwen Zhao +1

Reasoning & Chain-of-Thought RLHF & Preference Learning

Hong Kong JC STEM Lab of Smart City2w ago·also Fudan, HKU, HUST, Lingnan University +2

Self-Induced Outcome Potential: Turn-Level Credit Assignment for Agents without Verifiers

Finally, a way to train LLM agents to reason step-by-step without needing humans to check every intermediate thought.

Senkang Hu, Yong Dai, Xudong Han +4

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

2w ago

Reinforcement Learning for Compositional Generalization with Outcome-Level Optimization

RL can unlock better compositional generalization than supervised fine-tuning by directly optimizing for correct outcomes, especially on complex tasks where supervised models overfit.

Xiyan Fu, Wei Liu

Reasoning & Chain-of-Thought RLHF & Preference Learning

University of Science and Technology2w ago

Uncertainty-Aware Exploratory Direct Preference Optimization for Multimodal Large Language Models

MLLMs can overcome self-referential bias and improve visual grounding by actively exploring and correcting their cognitive deficiencies, guided by token-level epistemic uncertainty.

Huatian Zhang, Zhendong Mao, Lei Zhang +1

Computer Vision Multimodal Models RLHF & Preference Learning

Zhiqing Cui +132w ago

Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation

LLM multi-agent systems can achieve significantly higher accuracy at a fraction of the cost by learning to selectively delegate tasks instead of relying on rigid orchestration.

Zhiqing Cui, Haotong Xie, Jiahao Yuan +11

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Ilan University2w ago

Modular Reinforcement Learning For Cooperative Swarms

Decomposing robot swarm state representations unlocks effective cooperation even with computationally-limited agents.

Erel Shtossel, Gal A. Kaminka

Distributed Systems & Hardware RLHF & Preference Learning Robotics & Embodied AI

2w ago

Misaligned by Reward: Socially Undesirable Preferences in LLMs

Current reward models often *prefer* socially undesirable responses, revealing a critical gap in LLM alignment beyond instruction following.

Gayane Ghazaryan, Esra Dönmez

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

Yidong He +62w ago

Strat-Reasoner: Reinforcing Strategic Reasoning of LLMs in Multi-Agent Games

LLMs can learn to play multi-agent games far better by recursively modeling the reasoning of other players, leading to a 22% performance boost.

Yidong He, Yutao Lai, Pengxu Yang +4

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Haotian Xia +62w ago·also HKU, Northwestern

StoryAlign: Evaluating and Training Reward Models for Story Generation

Current reward models are surprisingly bad at judging story quality, achieving only 66% accuracy in selecting human-preferred narratives – a gap closed by a new, purpose-built reward model.

Haotian Xia, Hao Peng, Yunjia Qi +4

Eval Frameworks & Benchmarks Natural Language Processing RLHF & Preference Learning

Miao Wang +72w ago

Reward-Decomposed Reinforcement Learning for Immersive Video Role-Playing

Forget stilted, unconvincing VR characters: EBM-RL's novel reward decomposition finally makes video-grounded role-playing dialogue feel immersive.

Miao Wang, Yuling Shi, Yijiang Li +5

Natural Language Processing RLHF & Preference Learning Tool Use & Agents

2w ago

RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization

LLMs can get up to 6x more logically consistent without human feedback, simply by fusing NLI scores into the DPO training loop.

Qiming Bao, Juho Leinonen, Paul Denny +1

Natural Language Processing Reasoning & Chain-of-Thought RLHF & Preference Learning

2w ago·also AIST, Stockmark

Why Expert Alignment Is Hard: Evidence from Subjective Evaluation

Expert alignment is hard not just because of model limitations, but because human subjective evaluation is a moving target.

Tzu-Mi Lin, Wataru Hirota, Tatsuya Ishigaki +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

Lingzhe Zhang +82w ago

Towards Robust LLM Post-Training: Automatic Failure Management for Reinforcement Fine-Tuning

RFT's Achilles heel? This benchmark reveals how fragile reinforcement fine-tuning is, and introduces an automated system to catch and fix training failures before they tank your LLM.

Lingzhe Zhang, Tong Jia, Yunpeng Zhai +6

Distributed Systems & Hardware RLHF & Preference Learning Training Efficiency & Optimization

Jiaming Hu +42w ago

Towards General Preference Alignment: Diffusion Models at Nash Equilibrium

Ditch the Bradley-Terry model: a game-theoretic approach to diffusion alignment unlocks better text-to-image generation by directly optimizing for Nash equilibrium in human preferences.

Jiaming Hu, Jiamu Bai, Haoyu Wang +2

Computer Vision Multimodal Models RLHF & Preference Learning

May 5, 2026

Tianyang Han +102w ago

Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards

Stop rewarding reasoning that just looks good – reward reasoning that actually *helps* the downstream model solve the task.

Tianyang Han, Tianyang Han, Hengyu Shi +8

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Kazan Federal University2w ago·also Automation and Information Technologies, Department of Automated Systems for Data, Department of Data Analysis and Programming, Dmukhtasibovich -Doctor of Physical and Mathematical +5

Natural Language Processing: A Comprehensive Practical Guide from Tokenisation to RLHF

Learn to build and evaluate your own NLP pipeline, from tokenization to RLHF, using open-weight models and reproducible research practices.

Mullosharaf K. Arabov

Natural Language Processing Recommendation & Information Retrieval RLHF & Preference Learning

May 4, 2026

Haixin Wang +82w ago·also HKU

T^2PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

Multi-turn RL agents can learn far more effectively by explicitly monitoring and controlling uncertainty at both the token and turn levels, leading to more stable training and higher performance.

Haixin Wang, Hejie Cui, Chenwei Zhang +6

RLHF & Preference Learning Tool Use & Agents Training Efficiency & Optimization

Chenchen Zhang2w ago

Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces

Turns out, nobody's explicitly RL-training LLM agents when to *stop* in multi-agent systems, despite its critical role in efficiency and cost.

Chenchen Zhang

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

May 2, 2026

Google Research3w ago·also TAU

Hallucinations Undermine Trust; Metacognition is a Way Forward

LLMs' persistent hallucinations aren't just about lacking knowledge, but about lacking the self-awareness to know what they *don't* know, suggesting uncertainty expression is key to building trustworthy AI.

G. Yona, Mor Geva, Yossi Matias

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

May 1, 2026

Chengshuai Shi +123w ago

Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

Forget short-horizon RL: Odysseus proves VLMs can master 100+ turn decision-making in complex games, outperforming state-of-the-art models by 3x.

Chengshuai Shi, Wenzhe Li, Xin Liang +10

Multimodal Models RLHF & Preference Learning Tool Use & Agents

Minchan Kwon +53w ago

Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance

Red-teaming LLMs just got more robust: Stable-GFN sidesteps GFN's notorious instability, unlocking more diverse and effective attacks.

Minchan Kwon, Sunghyun Baek, Minseo Kim +3

Red-Teaming & Adversarial Robustness RLHF & Preference Learning

Minghui Chen +73w ago

Online Self-Calibration Against Hallucination in Vision-Language Models

LVLMs are better at spotting their own mistakes than generating correct answers in the first place, and this self-awareness can be exploited to reduce hallucinations.

Minghui Chen, Chenxu Yang, He Zhu +5

Computer Vision Multimodal Models RLHF & Preference Learning

Yi Wang +173w ago

Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

Generalist robot policies can achieve 95% success rates on real-world manipulation tasks by continually learning from a fleet of robots, even in the face of distribution shifts and long-tail failures.

Yi Wang, Xincheng Li, Pengwei Xie +15

Multimodal Models RLHF & Preference Learning Robotics & Embodied AI

Indraneil Paul +33w ago

Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

Current code reward models are myopic, mostly rewarding functional correctness, but Themis-RM learns to score code across multiple criteria and languages, opening the door to more nuanced and useful code generation.

Indraneil Paul, Glavavs Glavas, Glavaš Glavas +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks RLHF & Preference Learning

Zihan Lin +83w ago

ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

LLMs can reason better and generate more diverse outputs by projecting negative samples onto a positive subspace during reinforcement learning.

Zihan Lin, Xiaohan Wang, Jie Cao +6

Reasoning & Chain-of-Thought RLHF & Preference Learning

Search

RLHF & Preference Learning - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (28)