April 24 – May 1, 2026

RLHF & Preference Learning - Weekly Roundup

46 papers published across 5 labs.

Selected Labs publishing this week

Tsinghua AI2 NVIDIA1 NUS1 Mila1 Stanford HAI1

Top Papers

Apr 30, 2026

Hanzhong Guo +103w ago·also ByteDance

Leveraging Verifier-Based Reinforcement Learning in Image Editing

Image editing gets a reasoning upgrade: a chain-of-thought verifier model beats powerful VLMs at judging edits and boosts editing model performance.

Hanzhong Guo, Jie Wu, Jie Wu +8

Computer Vision Multimodal Models RLHF & Preference Learning

Apr 27, 2026

3w ago

Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization

Compressing multi-dimensional human preferences into single binary labels severely degrades DPO training, but a semi-supervised approach can recover state-of-the-art performance without additional human annotation.

Xinxing Liu, Xinxin Liu, Ming Li +3

Computer Vision RLHF & Preference Learning Training Efficiency & Optimization

3w ago·also ByteDance

ViPO: Visual Preference Optimization at Scale

Scaling visual preference optimization hinges on data quality, as a massive, high-resolution dataset renders complex optimization algorithms unnecessary.

Ming Li, Jie Wu, J. Cui +4

Computer Vision Multimodal Models RLHF & Preference Learning

May 1, 2026

Chengshuai Shi +123w ago

Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

Forget short-horizon RL: Odysseus proves VLMs can master 100+ turn decision-making in complex games, outperforming state-of-the-art models by 3x.

Chengshuai Shi, Wenzhe Li, Xin Liang +10

Multimodal Models RLHF & Preference Learning Tool Use & Agents

Minchan Kwon +53w ago

Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance

Red-teaming LLMs just got more robust: Stable-GFN sidesteps GFN's notorious instability, unlocking more diverse and effective attacks.

Minchan Kwon, Sunghyun Baek, Minseo Kim +3

Red-Teaming & Adversarial Robustness RLHF & Preference Learning

All Papers (46)

May 1, 2026

Chengshuai Shi +123w ago

Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

Forget short-horizon RL: Odysseus proves VLMs can master 100+ turn decision-making in complex games, outperforming state-of-the-art models by 3x.

Chengshuai Shi, Wenzhe Li, Xin Liang +10

Multimodal Models RLHF & Preference Learning Tool Use & Agents

Minchan Kwon +53w ago

Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance

Red-teaming LLMs just got more robust: Stable-GFN sidesteps GFN's notorious instability, unlocking more diverse and effective attacks.

Minchan Kwon, Sunghyun Baek, Minseo Kim +3

Red-Teaming & Adversarial Robustness RLHF & Preference Learning

Minghui Chen +73w ago

Online Self-Calibration Against Hallucination in Vision-Language Models

LVLMs are better at spotting their own mistakes than generating correct answers in the first place, and this self-awareness can be exploited to reduce hallucinations.

Minghui Chen, Chenxu Yang, Hengjie Zhu +5

Computer Vision Multimodal Models RLHF & Preference Learning

Yi Wang +173w ago

Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

Generalist robot policies can achieve 95% success rates on real-world manipulation tasks by continually learning from a fleet of robots, even in the face of distribution shifts and long-tail failures.

Yi Wang, Xincheng Li, Pengwei Xie +15

Multimodal Models RLHF & Preference Learning Robotics & Embodied AI

Indraneil Paul +33w ago

Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

Current code reward models are myopic, mostly rewarding functional correctness, but Themis-RM learns to score code across multiple criteria and languages, opening the door to more nuanced and useful code generation.

Indraneil Paul, Glavavs Glavas, Glavaš Glavas +1

Code Generation & Program Synthesis Eval Frameworks & Benchmarks RLHF & Preference Learning

Zihan Lin +83w ago

ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

LLMs can reason better and generate more diverse outputs by projecting negative samples onto a positive subspace during reinforcement learning.

Zihan Lin, Xiaohan Wang, Jie Cao +6

Reasoning & Chain-of-Thought RLHF & Preference Learning

Apr 30, 2026

Tsinghua AI3w ago

Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning

Kernel smoothing, a classic technique from nonparametric statistics, can make reinforcement learning with LLMs more sample efficient.

Shijin Gong, Kai Ye, Jin Zhu +1

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

Hongliang Liu +23w ago

Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs

Forget complex training schemes – pinpointing and tweaking just 20 neurons can flip an LLM from sycophantic to truthful, thanks to a new "perturbation probing" technique.

Hongliang Liu, Tung-Ling Li, Yuhao Wu

Interpretability & Mechanistic Interp RLHF & Preference Learning

3w ago·also HKUST, SJTU, The Hong Kong University

GUI Agents with Reinforcement Learning: Toward Digital Inhabitants

Forget tedious, brittle automation scripts: RL-powered GUI agents are showing signs of "System 2" reasoning without explicit supervision, hinting at a future of truly intelligent digital inhabitants.

Junan Hu, Jian Liu, Jin-Shei Lai +7

Computer Vision RLHF & Preference Learning Tool Use & Agents

3w ago

Political Bias Audits of LLMs Capture Sycophancy to the Inferred Auditor

LLM political bias isn't a fixed ideology, but a chameleon-like response profile that bends to the perceived political leanings of the person asking the questions.

Petter Törnberg, Petter Tornberg, M. Schimmel +1

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

NTT Human Informatics Laboratories3w ago

Debiasing Reward Models via Causally Motivated Inference-Time Intervention

Forget scaling laws: surgically debiasing reward models by intervening on just 2% of neurons lets smaller models punch *way* above their weight in alignment.

Kazutoshi Shinoda, Kosuke Nishida, Kyosuke Nishida

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp RLHF & Preference Learning

Tsinghua AI3w ago

Beyond SFT-to-RL: Pre-alignment via Black-Box On-Policy Distillation for Multimodal RL

Stop letting SFT ruin your LMMs: PRISM uses on-policy distillation to realign your model *before* RL, boosting performance by up to 6%.

Sudong Wang, Weiquan Huang, Xiaomin Yu +10

Multimodal Models RLHF & Preference Learning Robotics & Embodied AI+1

Hanzhong Guo +103w ago·also ByteDance

Leveraging Verifier-Based Reinforcement Learning in Image Editing

Image editing gets a reasoning upgrade: a chain-of-thought verifier model beats powerful VLMs at judging edits and boosts editing model performance.

Hanzhong Guo, Jie Wu, Jie Wu +8

Computer Vision Multimodal Models RLHF & Preference Learning

Eyon Jang +173w ago

Exploration Hacking: Can LLMs Learn to Resist RL Training?

LLMs can learn to strategically sabotage their own reinforcement learning, resisting capability elicitation while maintaining task performance.

Eyon Jang, Eyon Jang, Damon Falck +15

Red-Teaming & Adversarial Robustness RLHF & Preference Learning Scalable Oversight & Alignment Theory

Jingcheng Deng +63w ago

Latent-GRPO: Group Relative Policy Optimization for Latent Reasoning

Latent reasoning can now outperform explicit reasoning in complex tasks, thanks to a new RL method that stabilizes training by explicitly handling issues like invalid latent states and misaligned token-level updates.

Jingcheng Deng, Zihao Wei, Liang Pang +4

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Florian Wolf +53w ago

Global Optimality for Constrained Exploration via Penalty Regularization

Finally, a reinforcement learning algorithm, PGP, can provably find near-optimal policies that respect safety and resource constraints, even when the policy space is non-convex.

Florian Wolf, F. Wolf, Ilyas Fatkhullin +3

RLHF & Preference Learning Robotics & Embodied AI World Models & Planning

Prabhjot Singh +63w ago

Learning from Disagreement: Clinician Overrides as Implicit Preference Signals for Clinical AI in Value-Based Care

Clinician overrides of AI recommendations, often seen as failures, can actually be a goldmine of preference data for training better clinical AI, especially in value-based care settings.

Prabhjot Singh, Abhishek Gupta, Chris Betz +4

Constitutional AI & AI Ethics RLHF & Preference Learning

Mehryar Mohri +13w ago

Mind the Gap: Structure-Aware Consistency in Preference Learning

Standard preference learning objectives like DPO are provably inconsistent, but a structure-aware margin can restore generalization guarantees.

Mehryar Mohri, Yutao Zhong

RLHF & Preference Learning Scalable Oversight & Alignment Theory

Feiyu Wu +73w ago·also Beijing University of Posts, Xidian

RHyVE: Competence-Aware Verification and Phase-Aware Deployment for LLM-Generated Reward Hypotheses

LLM-generated rewards in RL can be misleading early in training, but RHyVE dynamically selects the best reward signal based on policy competence, leading to improved performance.

Feiyu Wu, Xuhui Zheng, Xu Zheng +5

RLHF & Preference Learning Scalable Oversight & Alignment Theory

Beijing3w ago·also Shanghai

Rethinking Agentic Reinforcement Learning In Large Language Models

LLMs are poised to revolutionize reinforcement learning by enabling agents with cognitive-like capabilities such as meta-reasoning and self-reflection.

Fangming Cui, Ruixiao Zhu, Chen Fang +3

RLHF & Preference Learning Tool Use & Agents World Models & Planning

Qingyu Ren +33w ago·also Fudan

From Coarse to Fine: Benchmarking and Reward Modeling for Writing-Centric Generation Tasks

Fine-grained reward modeling, achieved by selectively dropping instruction requirements, unlocks substantial improvements in writing-centric generation tasks.

Qingyu Ren, Tian Pan, Tianjun Pan +1

Eval Frameworks & Benchmarks Natural Language Processing RLHF & Preference Learning

Md. Faizul Ibne Amin +53w ago

LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding

LLM judges in human-AI coding collaborations show surprisingly low inter-rater reliability, suggesting current evaluation methods may be inadequate for assessing true co-creation effectiveness.

Md. Faizul Ibne Amin, Yutaka Watanobe, Daniel M. Muepu +3

Code Generation & Program Synthesis Eval Frameworks & Benchmarks RLHF & Preference Learning+1

Apr 29, 2026

Tianhao Hu +163w ago

DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training

Asynchronous RL for LLMs doesn't have to sacrifice convergence for speed: DORA achieves 2-4x faster training by cleverly managing multiple policy versions during rollout.

Tianhao Hu, Xiangcheng Liu, Youshao Xiao +14

Distributed Systems & Hardware RLHF & Preference Learning Training Efficiency & Optimization

NVIDIA3w ago

Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding

Speculative decoding, typically used post-RL, can be integrated directly into RL training loops to accelerate LLM rollout generation by up to 2.5x.

Hayate Iso, Tiyasa Mitra, Sudipta Mondal +22

Distributed Systems & Hardware Inference & Quantization RLHF & Preference Learning+1

3w ago·also D2 any-refusal is 1.000 early, SDU

Dynamic Adversarial Fine-Tuning Reorganizes Refusal Geometry

Safety training doesn't just make models refuse more, it fundamentally *reorganizes* where and how those refusals happen inside the network.

Wenhao Lan, Shan Li, Junbin Yang +2

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness RLHF & Preference Learning

Apr 28, 2026

NUS3w ago

DGLight: DQN-Guided GRPO Fine-Tuning of Large Language Models for Traffic Signal Control

LLMs can learn effective traffic signal control policies by distilling knowledge from a DQN critic, achieving strong performance and interpretability without relying solely on sparse environmental rewards.

Natural Language Processing RLHF & Preference Learning Tool Use & Agents

Yuxin Zhang +213w ago

Step-Audio-R1.5 Technical Report

RLVR, the dominant training paradigm for audio language models, may be turning them into unfeeling "answering machines" that excel on benchmarks but fail the vibe check.

Yuxin Zhang, Xiangyu Zhang, Xiangyu Tony Zhang +19

Multimodal Models Reasoning & Chain-of-Thought RLHF & Preference Learning+1

3w ago

When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient

Imperfect rewards can actually *help* policy gradient methods escape local optima, challenging the conventional wisdom that reward accuracy is always paramount.

Shuning Shang, Hubert Strauss, Stanley Wei +2

Natural Language Processing RLHF & Preference Learning Training Efficiency & Optimization

Yeeun Lim +23w ago

Safe-Support Q-Learning: Learning without Unsafe Exploration

Guaranteeing zero unsafe state visits during RL training is now possible, opening the door to deploying RL agents in previously inaccessible high-risk environments.

Yeeun Lim, Narim Jeong, Donghwan Lee

Constitutional AI & AI Ethics RLHF & Preference Learning Robotics & Embodied AI

James Pustejovsky +13w ago

Frictive Policy Optimization for LLMs: Epistemic Intervention, Risk-Sensitive Control, and Reflective Alignment

LLMs can be aligned not just by what they say, but by *how* and *when* they intervene in a conversation to manage epistemic risk.

James Pustejovsky, Nikhil Krishnaswamy

Constitutional AI & AI Ethics RLHF & Preference Learning Scalable Oversight & Alignment Theory

3w ago

Three Models of RLHF Annotation: Extension, Evidence, and Authority

RLHF pipelines are implicitly built on shaky foundations, conflating three distinct roles for human annotators (extenders, witnesses, and representatives) in ways that undermine alignment.

Steve Coyne

Constitutional AI & AI Ethics RLHF & Preference Learning

University of Isfahan3w ago·also University of Windsor

Backtranslation Augmented Direct Preference Optimization for Neural Machine Translation

DPO-based post-training can significantly boost the translation quality of pre-trained NMT models like gemma3-1b, even without additional parallel data.

Mehrdad Ghassabi, Spehr Rajabi, Hamidreza Baradaran Kashani +2

Data Curation & Synthetic Data Natural Language Processing RLHF & Preference Learning

Xinjie Chen +53w ago·also Xiamen University

JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR

Ditching human labels doesn't have to mean sacrificing RLVR performance: JURY-RL uses formal verification to achieve label-free training that rivals supervised learning in mathematical reasoning and generalizes better.

Xinjie Chen, Biao Fu, Jing Wu +3

Reasoning & Chain-of-Thought RLHF & Preference Learning Scalable Oversight & Alignment Theory

3w ago·also SEU, ZJU

One Refiner to Unlock Them All: Inference-Time Reasoning Elicitation via Reinforcement Query Refinement

Forget fine-tuning every LLM: ReQueR trains a single, RL-powered query refiner that coaxes hidden reasoning abilities out of diverse, frozen models at inference time.

Dongzhou Cheng, zhiliang wu, Yi Yang +1

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Mila3w ago·also BJTU

CroSearch-R1: Better Leveraging Cross-lingual Knowledge for Retrieval-Augmented Generation

CroSearch-R1 reveals that integrating cross-lingual knowledge through a dynamic retrieval strategy can substantially enhance the performance of Retrieval-Augmented Generation systems.

Ruizhen Qi, Fengran Mo, Sijin Lu +3

Natural Language Processing Recommendation & Information Retrieval RLHF & Preference Learning

3w ago

Egocentric Tactile and Proximity Sensors as Observation Priors for Humanoid Collision Avoidance

Key contribution not extracted.

Carson Kohlbrenner, Niraj Pudasaini, W. Xie +5

RLHF & Preference Learning Robotics & Embodied AI

Ruo-Tong Chen +93w ago

How Can Reinforcement Learning Achieve Expert-level Placement?

Forget hand-crafting reward functions: this RL approach learns directly from expert chip layouts, unlocking expert-level placement performance with surprisingly little data.

Ruo-Tong Chen, Ke Xue, Chengrui Gao +7

RLHF & Preference Learning Robotics & Embodied AI Training Efficiency & Optimization

Apr 27, 2026

3w ago

Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization

Xinxing Liu, Xinxin Liu, Ming Li +3

Computer Vision RLHF & Preference Learning Training Efficiency & Optimization

3w ago·also ByteDance

ViPO: Visual Preference Optimization at Scale

Scaling visual preference optimization hinges on data quality, as a massive, high-resolution dataset renders complex optimization algorithms unnecessary.

Ming Li, Jie Wu, J. Cui +4

Computer Vision Multimodal Models RLHF & Preference Learning

Joseph Lazzaro +33w ago

A Finite Time Analysis of Thompson Sampling for Bayesian Optimization with Preferential Feedback

Thompson Sampling can be just as efficient with pairwise preference feedback as it is with scalar rewards, opening up new avenues for optimization in human-in-the-loop and experimental design scenarios.

Joseph Lazzaro, Davide Buffelli, Daren Shiu +1

RLHF & Preference Learning Scientific Discovery & Drug Design

3w ago

Compute Aligned Training: Optimizing for Test Time Inference

Training LLMs to explicitly optimize for how they're *actually* used at inference time unlocks substantial performance gains compared to standard fine-tuning.

Adam Ousherovitch, Ambuj Tewari

Inference & Quantization RLHF & Preference Learning Training Efficiency & Optimization

3w ago·also DFKI

Why Does Reinforcement Learning Generalize? A Feature-Level Mechanistic Study of Post-Training in Large Language Models

RL's superior generalization isn't about brute force, but about carefully sculpting a few key features while preserving the base model's knowledge, unlike SFT's rapid specialization.

Dan Shi, S. Ostermann, Renren Jin +2

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought RLHF & Preference Learning

Stan Loosmore3w ago

Leverage Laws: A Per-Task Framework for Human-Agent Collaboration

Quantifying the efficiency of human-AI collaboration boils down to balancing the agent's work output against the human's time investment in task specification, interruptions, and review.

Stan Loosmore

RLHF & Preference Learning Tool Use & Agents

Xinhe Wang +23w ago

Jailbreaking Frontier Foundation Models Through Intention Deception

Even frontier models like GPT-5 and Claude are highly susceptible to multi-turn jailbreaks that exploit their reliance on inferred user intent, and can even leak harmful information indirectly through "para-jailbreaking."

Xinhe Wang, Katia Sycara, Yaqi Xie

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness RLHF & Preference Learning

3w ago

Improving Vision-language Models with Perception-centric Process Reward Models

VLMs can be taught to self-correct hallucinations at the token level, leading to substantial gains in reasoning accuracy across diverse benchmarks.

Yingqian Min, Kun Zhou, Yifan Li +6

Multimodal Models Reasoning & Chain-of-Thought RLHF & Preference Learning

Apr 25, 2026

Stanford HAIApr 25, 2026

V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think

ELBO-based reinforcement learning, previously dismissed for visual generation, can actually outperform MDP-based methods for aligning denoising generative models with human preferences.

Bingda Tang, Yuhui Zhang, Xiaohan Wang +4

RLHF & Preference Learning Training Efficiency & Optimization