March 25 – April 1, 2026

RLHF & Preference Learning - Weekly Roundup

29 papers published across 4 labs.

28% acceleration

Selected Labs publishing this week

MIT CSAIL2 DeepMind1 BAIR1 Microsoft Research1

Top Papers

Mar 31, 2026

1d ago

MotionVL: Vision-Language Supervision for Reinforcement Learning of Humanoid Motion

Forget hand-crafted rewards: MotionVL uses VLMs and LLMs to automatically generate task-aligned reward functions for humanoid robot RL, leading to more human-like and robust motion.

Yan Luo, Jianhua Wu, Zhenhua Xiong +1

Multimodal Models RLHF & Preference Learning Robotics & Embodied AI

DeepMind1d ago

Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?

Training LLMs to optimize for conflicting objectives between the final output and the reasoning process can significantly degrade the monitorability of Chain-of-Thought, making oversight more difficult.

Max Kaufmann, David Lindner, Roland S. Zimmermann +1

Reasoning & Chain-of-Thought RLHF & Preference Learning Scalable Oversight & Alignment Theory

Md Saad +21d ago

Hybrid Framework for Robotic Manipulation: Integrating Reinforcement Learning and Large Language Models

Robots get a 33% speed boost and become significantly more adaptable when you let LLMs handle the reasoning and RL handle the movements.

Md Saad, Sajjad Hussain, Mohd Suhaib

RLHF & Preference Learning Robotics & Embodied AI Tool Use & Agents

Nathan Heath1d ago

Extending MONA in Camera Dropbox: Reproduction, Learned Approval, and Design Implications for Reward-Hacking Mitigation

Learned approval in MONA can eliminate reward hacking, but at the cost of significantly under-optimizing for the intended task, revealing a critical trade-off in safe RL.

Nathan Heath

Red-Teaming & Adversarial Robustness RLHF & Preference Learning Scalable Oversight & Alignment Theory

MIT CSAIL1d ago

ShapE-GRPO: Shapley-Enhanced Reward Allocation for Multi-Candidate LLM Training

Stop rewarding all LLM-generated candidates equally: ShapE-GRPO uses Shapley values to fairly distribute credit within sets, leading to better training and faster convergence.

Rui Ai, Yu Pan, David Simchi-Levi +1

Recommendation & Information Retrieval RLHF & Preference Learning Tool Use & Agents

All Papers (29)

Mar 31, 2026

1d ago

MotionVL: Vision-Language Supervision for Reinforcement Learning of Humanoid Motion

Forget hand-crafted rewards: MotionVL uses VLMs and LLMs to automatically generate task-aligned reward functions for humanoid robot RL, leading to more human-like and robust motion.

Yan Luo, Jianhua Wu, Zhenhua Xiong +1

Multimodal Models RLHF & Preference Learning Robotics & Embodied AI

DeepMind1d ago

Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?

Max Kaufmann, David Lindner, Roland S. Zimmermann +1

Reasoning & Chain-of-Thought RLHF & Preference Learning Scalable Oversight & Alignment Theory

Md Saad +21d ago

Hybrid Framework for Robotic Manipulation: Integrating Reinforcement Learning and Large Language Models

Robots get a 33% speed boost and become significantly more adaptable when you let LLMs handle the reasoning and RL handle the movements.

Md Saad, Sajjad Hussain, Mohd Suhaib

RLHF & Preference Learning Robotics & Embodied AI Tool Use & Agents

Nathan Heath1d ago

Extending MONA in Camera Dropbox: Reproduction, Learned Approval, and Design Implications for Reward-Hacking Mitigation

Learned approval in MONA can eliminate reward hacking, but at the cost of significantly under-optimizing for the intended task, revealing a critical trade-off in safe RL.

Nathan Heath

Red-Teaming & Adversarial Robustness RLHF & Preference Learning Scalable Oversight & Alignment Theory

MIT CSAIL1d ago

ShapE-GRPO: Shapley-Enhanced Reward Allocation for Multi-Candidate LLM Training

Stop rewarding all LLM-generated candidates equally: ShapE-GRPO uses Shapley values to fairly distribute credit within sets, leading to better training and faster convergence.

Rui Ai, Yu Pan, David Simchi-Levi +1

Recommendation & Information Retrieval RLHF & Preference Learning Tool Use & Agents

Luoxin Chen +21d ago

Learning to Generate Formally Verifiable Step-by-Step Logic Reasoning via Structured Formal Intermediaries

Reward LLMs for verifiable reasoning steps, not just correct answers, to get more reliable multi-step logic.

Luoxin Chen, Yichi Zhou, Huishuai Zhang

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought RLHF & Preference Learning

Ziliang Guo +21d ago

MemFactory: Unified Inference&Training Framework for Agent Memory

Stop cobbling together memory-augmented agents: MemFactory offers a unified "Lego-like" framework that streamlines training and boosts performance by up to 14.8%.

Ziliang Guo, Ziheng Li, Zhiyu Li

Eval Frameworks & Benchmarks RLHF & Preference Learning Tool Use & Agents

Hejin Huang +41d ago·also Snap Research

Aligning Multimodal Sequential Recommendations via Robust Direct Preference Optimization with Sparse MoE

Stochastic negative sampling in Direct Preference Optimization (DPO) dramatically improves multimodal sequential recommendation, suggesting that carefully curated "wrong" answers are key to preference learning.

Hejin Huang, Jusheng Zhang, Kaitong Cai +2

Multimodal Models Recommendation & Information Retrieval RLHF & Preference Learning

Mar 30, 2026

Yash Savani +62d ago

Stepwise Credit Assignment for GRPO on Flow-Matching Models

Correcting errors early in the diffusion process matters more than fixing them later: Stepwise-Flow-GRPO leverages this insight to dramatically improve RL-based flow model training.

Yash Savani, Branislav Kveton, Yuchen Liu +4

Computer Vision RLHF & Preference Learning Training Efficiency & Optimization

Aurelien Bibaut +32d ago

Functional Natural Policy Gradients

Unlock $\sqrt{N}$ regret in offline policy learning, even with complex policy classes, by trading off policy and environment complexity.

Aurelien Bibaut, Houssam Zenati, Thibaud Rahier +1

RLHF & Preference Learning Training Efficiency & Optimization

Linqian Fan +42d ago

$R_{dm}$: Re-conceptualizing Distribution Matching as a Reward for Diffusion Distillation

Stop handcuffing student diffusion models to their teachers: framing distribution matching as a reward unlocks more stable and performant distillation via RL techniques.

Linqian Fan, Peiqin Sun, Tiancheng Wen +2

Computer Vision Inference & Quantization RLHF & Preference Learning

2d ago

Evolutionary Discovery of Reinforcement Learning Algorithms via Large Language Models

Forget hand-designed RL algorithms – LLMs can evolve competitive learners from scratch, even when forced to invent completely new update rules.

Alkis Sygkounas, Amy Loutfi, Andreas Persson

Code Generation & Program Synthesis RLHF & Preference Learning Tool Use & Agents

2d ago·also Aalto, Edinburgh, ELLIS, UvA

Mixture-Model Preference Learning for Many-Objective Bayesian Optimization

Stop assuming a single utility function: modeling preferences as a mixture of archetypes unlocks better Bayesian optimization in complex, many-objective spaces.

Manisha Dubey, Sebastiaan De Peuter, Wanrong Wang +1

Recommendation & Information Retrieval RLHF & Preference Learning

Andi Nika +42d ago

Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback

Even with corrupted human feedback, surprisingly tight guarantees for multi-agent reinforcement learning are possible.

Andi Nika, Debmalya Mandal, Parameswaran Kamalaruban +2

Red-Teaming & Adversarial Robustness RLHF & Preference Learning

Corresponding Author2d ago

ERPO: Token-Level Entropy-Regulated Policy Optimization for Large Reasoning Models

LLMs can reason more accurately and concisely when RL is guided by token-level entropy, pinpointing and exploring "forks in the road" during the reasoning process.

Song Yu, Li Li

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

Udita Ghosh +42d ago

Reducing Oracle Feedback with Vision-Language Embeddings for Preference-Based RL

Get 80% of your oracle feedback for free: ROVED leverages vision-language embeddings to drastically reduce the need for human preferences in reinforcement learning.

Udita Ghosh, Dripta S. Raychaudhuri, Jiachen Li +2

Multimodal Models RLHF & Preference Learning Robotics & Embodied AI

Bodla Krishna Vamshi +12d ago

Principal Prototype Analysis on Manifold for Interpretable Reinforcement Learning

Forget hand-crafting prototypes for interpretable RL: this method learns them directly from the data, matching the performance of expert-designed systems.

Bodla Krishna Vamshi, Haizhao Yang

Interpretability & Mechanistic Interp RLHF & Preference Learning

Yujie Zhang +32d ago

EpiPersona: Persona Projection and Episode Coupling for Pluralistic Preference Modeling

LLMs can better adapt to diverse preferences by explicitly separating stable personal traits from situational factors, leading to significant performance gains, especially when preferences shift across episodes.

Yujie Zhang, Weikang Yuan, Zhuoren Jiang +1

Constitutional AI & AI Ethics Natural Language Processing RLHF & Preference Learning

Jiacheng Wang +12d ago

Reward Hacking as Equilibrium under Finite Evaluation

Reward hacking isn't a bug to fix, but an inevitable consequence of how we evaluate AI, and it gets exponentially worse as agents gain more tools.

Jiacheng Wang, Jinbin Huang

Constitutional AI & AI Ethics RLHF & Preference Learning Scalable Oversight & Alignment Theory

Bilgehan Sel +42d ago

Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

Adversarial fine-tuning can now bypass Constitutional AI safety measures with almost no performance penalty, enabling models to provide detailed instructions on dangerous topics like CBRN warfare.

Bilgehan Sel, Xuanli He, Alwin Peng +2

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness RLHF & Preference Learning

Philip Schroeder +112d ago

SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning

Robots can now learn complex manipulation tasks from scratch using only video and language, bypassing the need for hand-engineered reward functions, demonstrations, or even task-specific tuning.

Philip Schroeder, Philip Schroeder, Thomas Weng +9

Multimodal Models RLHF & Preference Learning Robotics & Embodied AI

Parham Pourdavood2d ago

Does Claude's Constitution Have a Culture?

Claude's Constitution doesn't create a neutral AI, but instead bakes in the values of Northern European and Anglophone cultures, creating a value floor that's hard to shift.

Parham Pourdavood

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

Mar 29, 2026

Bayan Abdullah Aldahlawi +23d ago

Investigating the Influence of Language on Sycophantic Behavior of Multilingual LLMs

Even state-of-the-art LLMs like GPT-4o and Claude 3.5 still exhibit varying degrees of sycophancy depending on the input language, revealing persistent cultural and linguistic biases.

Bayan Abdullah Aldahlawi, A. B. M. Ashikur Rahman, Irfan Ahmad

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

Utsav Maskey +23d ago

Over-Refusal and Representation Subspaces: A Mechanistic Analysis of Task-Conditioned Refusal in Aligned LLMs

Over-refusal isn't just a misapplication of a global "no" switch; it's deeply intertwined with how LLMs represent and execute specific tasks.

Utsav Maskey, Mark Dras, Usman Naseem

Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness RLHF & Preference Learning

Shentong Mo +13d ago

LVRPO: Language-Visual Alignment with GRPO for Multimodal Understanding and Generation

Forget auxiliary encoders and handcrafted losses: LVRPO uses reinforcement learning to directly align language and vision, boosting performance across a range of multimodal tasks.

Shentong Mo, Sukmin Yun

Architecture Design (Transformers, SSMs, MoE)Multimodal Models RLHF & Preference Learning

3d ago·also PKU

RTLSeek: Boosting the LLM-Based RTL Generation with Multi-Stage Diversity-Oriented Reinforcement Learning

Unlock better hardware designs: RTLSeek's diversity-oriented RL lets LLMs explore a wider range of Verilog implementations, boosting both correctness and design options.

Xinyu Zhang, Zhiteng Chao, Yonghao Wang +6

Code Generation & Program Synthesis RLHF & Preference Learning

Fengxian Li +553d ago

KAT-Coder-V2 Technical Report

Agentic coding models can achieve near-SOTA performance by specializing in distinct coding domains before unifying them via on-policy distillation.

Fengxian Li, Fengxiang Li, Han Zhang +53

Code Generation & Program Synthesis RLHF & Preference Learning Tool Use & Agents+1

Mar 25, 2026

BAIR1w ago·also Microsoft Research, IIT

Composer 2 Technical Report

Training domain-specific coding LLMs with realistic environments and large-scale RL can yield substantial gains in practical software engineering tasks.

Cursor Reseach Aaron Chan, Ahmed Shalaby, Alexander Wettig +51

Code Generation & Program Synthesis RLHF & Preference Learning Tool Use & Agents

MIT CSAIL1w ago

Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models

LMs can learn to generate multiple plausible answers in a single forward pass, outperforming traditional single-answer models on tasks requiring distributional reasoning and offering a compute-efficient alternative to best-of-k sampling.

Isha Puri, Mehul Damani, Idan Shenfeld +3

Natural Language Processing Reasoning & Chain-of-Thought RLHF & Preference Learning

Search

RLHF & Preference Learning - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (29)