March 11 – March 18, 2026

RLHF & Preference Learning - Weekly Roundup

75 papers published across 5 labs.

28% acceleration

Selected Labs publishing this week

CMU ML3 Tsinghua AI1 DAMO1 Microsoft Research1 Stanford HAI1

Top Papers

Mar 12, 2026

2w ago

Entropy-Preserving Reinforcement Learning

Policy gradient methods may be self-defeating in language model reasoning, as their inherent entropy reduction chokes off exploration and limits downstream performance.

Aleksei Petrenko, Ben Lipkin, Kevin Chen +4

Natural Language Processing RLHF & Preference Learning Training Efficiency & Optimization

Mar 18, 2026

Tharun Sethuraman +52w ago

Interpreting Context-Aware Human Preferences for Multi-Objective Robot Navigation

Robots can now navigate based on your spoken preferences and visual context, thanks to a clever fusion of VLMs, LLMs, and multi-objective RL.

Tharun Sethuraman, Subham Agrawal, Nils Dengler +3

Natural Language Processing RLHF & Preference Learning Robotics & Embodied AI

Chengwei Wei +42w ago

InfoDensity: Rewarding Information-Dense Traces for Efficient Reasoning

LLMs can achieve state-of-the-art reasoning accuracy with significantly fewer tokens by rewarding intermediate reasoning steps that maximize information gain and maintain monotonic progress.

Chengwei Wei, Jung-jae Kim, Longyin Zhang +2

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

Jiashun Liu +12w ago

Complementary Reinforcement Learning

RL agents can learn far more efficiently by dynamically distilling and leveraging past experiences that co-evolve with the agent's growing capabilities.

Jiashun Liu, Bo Zheng

RLHF & Preference Learning Tool Use & Agents Training Efficiency & Optimization

Hao Ma +32w ago

Efficient Soft Actor-Critic with LLM-Based Action-Level Guidance for Continuous Control

LLMs can act as effective action-level supervisors in reinforcement learning, dramatically boosting the sample efficiency of SAC without sacrificing convergence guarantees.

Hao Ma, Zhiqiang Pu, Xiaolin Ai +1

RLHF & Preference Learning Robotics & Embodied AI Tool Use & Agents

All Papers (75)

Mar 18, 2026

Tharun Sethuraman +52w ago

Interpreting Context-Aware Human Preferences for Multi-Objective Robot Navigation

Robots can now navigate based on your spoken preferences and visual context, thanks to a clever fusion of VLMs, LLMs, and multi-objective RL.

Tharun Sethuraman, Subham Agrawal, Nils Dengler +3

Natural Language Processing RLHF & Preference Learning Robotics & Embodied AI

Chengwei Wei +42w ago

InfoDensity: Rewarding Information-Dense Traces for Efficient Reasoning

LLMs can achieve state-of-the-art reasoning accuracy with significantly fewer tokens by rewarding intermediate reasoning steps that maximize information gain and maintain monotonic progress.

Chengwei Wei, Jung-jae Kim, Longyin Zhang +2

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

Jiashun Liu +12w ago

Complementary Reinforcement Learning

RL agents can learn far more efficiently by dynamically distilling and leveraging past experiences that co-evolve with the agent's growing capabilities.

Jiashun Liu, Bo Zheng

RLHF & Preference Learning Tool Use & Agents Training Efficiency & Optimization

Hao Ma +32w ago

Efficient Soft Actor-Critic with LLM-Based Action-Level Guidance for Continuous Control

LLMs can act as effective action-level supervisors in reinforcement learning, dramatically boosting the sample efficiency of SAC without sacrificing convergence guarantees.

Hao Ma, Zhiqiang Pu, Xiaolin Ai +1

RLHF & Preference Learning Robotics & Embodied AI Tool Use & Agents

Haozheng Luo +32w ago

Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations

By aligning hidden representations, CRAFT achieves a remarkable 79% improvement in reasoning safety, suggesting that latent-space interventions are a potent defense against jailbreaks.

Haozheng Luo, Yimin Wang, Jiahao Yu +1

Reasoning & Chain-of-Thought Red-Teaming & Adversarial Robustness RLHF & Preference Learning

2w ago

CoVerRL: Breaking the Consensus Trap in Label-Free Reasoning via Generator-Verifier Co-Evolution

LLMs can escape the trap of confidently wrong reasoning by co-evolving a generator and verifier from a single model, bootstrapping each other to break free from flawed consensus.

Teng Pan, Tengyu Pan, Yuchen Yan +7

Reasoning & Chain-of-Thought RLHF & Preference Learning

Guanlin Feng +12w ago

AR-CoPO: Align Autoregressive Video Generation with Contrastive Policy Optimization

RLHF for autoregressive video generation gets a boost with AR-CoPO, which overcomes the limitations of SDE-based methods by using chunk-level alignment and a semi-on-policy training strategy.

Guanlin Feng, Hongsheng Li

Computer Vision Multimodal Models RLHF & Preference Learning

Corentin Royer +52w ago

Process Supervision for Chain-of-Thought Reasoning via Monte Carlo Net Information Gain

Forget expensive human annotations – this new method uses information theory to automatically score each step of an LLM's reasoning process, making chain-of-thought supervision scalable and efficient.

Corentin Royer, Debarun Bhattacharjya, D. Bhattacharjya +3

Reasoning & Chain-of-Thought RLHF & Preference Learning

S. Asghari +92w ago

Efficient Exploration at Scale

Online RLHF can match the performance of offline RLHF with 10x less data, and potentially 1000x at scale.

S. Asghari, Seyed Mohammad Asghari, Chris Chute +7

RLHF & Preference Learning Training Efficiency & Optimization

2w ago·also McGill, Purdue

DriveVLM-RL: Neuroscience-Inspired Reinforcement Learning with Vision-Language Models for Safe and Deployable Autonomous Driving

Autonomous vehicles can now leverage the rich semantic understanding of VLMs for safer driving without the computational overhead, thanks to a clever training strategy that distills VLM knowledge into a real-time RL policy.

Zilin Huang, Zihao Sheng, Zhengyan Wan +3

Multimodal Models RLHF & Preference Learning Robotics & Embodied AI

Mar 17, 2026

Tianzhu Ye +112w ago

Online Experiential Learning for Language Models

Language models can learn directly from real-world user interactions, boosting performance without human annotations or simulated environments.

Tianzhu Ye, Tianzhu Ye, Li Dong +9

Natural Language Processing RLHF & Preference Learning Tool Use & Agents

2w ago

Designing for Disagreement: Front-End Guardrails for Assistance Allocation in LLM-Enabled Robots

User-facing guardrails for LLM-enabled robots can balance flexibility and safety by offering constrained choices and clear recourse, rather than open-ended value settings.

Carmen Ng

Constitutional AI & AI Ethics RLHF & Preference Learning Robotics & Embodied AI+1

Yuxuan Zhu +12w ago

Noisy Data is Destructive to Reinforcement Learning with Verifiable Rewards

Contrary to claims that RLVR can handle noisy data, this work reveals that current RLVR methods still suffer significantly from data quality issues, with performance dropping 8-12% when trained on truly noisy data.

Yuxuan Zhu, Daniel Kang

Data Curation & Synthetic Data RLHF & Preference Learning

2w ago

Grounding the Score: Explicit Visual Premise Verification for Reliable Vision-Language Process Reward Models

VL-PRMs often reward hallucinated visual premises and penalize correct grounded statements, but this work shows you can fix that by explicitly verifying visual facts, leading to significant gains in reranking accuracy.

Multimodal Models Reasoning & Chain-of-Thought RLHF & Preference Learning

2w ago

Large Reward Models: Generalizable Online Robot Reward Generation with Vision-Language Models

Forget hand-engineered reward functions: this work shows VLMs can provide reliable, zero-shot feedback for online robot policy refinement, boosting success rates on manipulation tasks in just 30 RL iterations.

Yanru Wu, Weiduo Yuan, Ang Qi +3

Multimodal Models RLHF & Preference Learning Robotics & Embodied AI

2w ago

Shielded Reinforcement Learning Under Dynamic Temporal Logic Constraints

Guaranteeing complex mission objectives in RL is now tractable: this method enforces Signal Temporal Logic constraints, enabling robots to learn while adhering to dynamic, time-sensitive tasks.

Sadik Bera Yuksel, Ali Tevfik Buyukkocak, Derya Aksaray

RLHF & Preference Learning Robotics & Embodied AI

2w ago

Good Arguments Against the People Pleasers: How Reasoning Mitigates (Yet Masks) LLM Sycophancy

Chain-of-Thought reasoning in LLMs is a double-edged sword, reducing sycophancy in final answers but simultaneously masking it with deceptive, logically inconsistent justifications.

Zhaoxin Feng, Zheng Chen, Jianfei Ma +3

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought RLHF & Preference Learning

Benoît Alcaraz2w ago

What if Pinocchio Were a Reinforcement Learning Agent: A Normative End-to-End Pipeline

Reinforcement learning agents can now learn to be "good" (i.e., norm-compliant) via a novel pipeline that leverages argumentation-based normative advisors and automatically extracts the reasoning behind those norms.

Benoît Alcaraz

Constitutional AI & AI Ethics RLHF & Preference Learning Tool Use & Agents

Xizhong Yang +22w ago

From the Inside Out: Progressive Distribution Refinement for Confidence Calibration

By progressively refining the reward signal based on the distribution of model confidence, DistriTTRL achieves significant performance gains in RL by better aligning internal information between training and test time and mitigating reward hacking.

Xizhong Yang, Huiming Wang, Mofei Song

RLHF & Preference Learning Training Efficiency & Optimization

Zelin Zhang +22w ago

When and Why Does Unsupervised RL Succeed in Mathematical Reasoning? A Manifold Envelopment Perspective

Unsupervised RL for math reasoning hinges on a model's pre-existing logical abilities, and its success can be predicted by whether the training trajectory stays within stable "manifolds" of good solutions.

Zelin Zhang, Fei Cheng, Chenhui Chu

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

Rui Ge +52w ago

Internalizing Agency from Reflective Experience

LLMs can learn to recover from mistakes more effectively by reflecting on past failures and internalizing actionable feedback, leading to significant gains in long-horizon problem-solving.

Rui Ge, Yichao Fu, Yuyang Qian +3

RLHF & Preference Learning Tool Use & Agents World Models & Planning

2w ago·also TU Munich, University of Technology Nuremberg

Rewarding DINO: Predicting Dense Rewards with Vision Foundation Models

Forget hand-engineered reward functions: Rewarding DINO learns dense, generalizable rewards for robot manipulation directly from visual data, opening the door to more autonomous skill acquisition.

Pierre Krack, Pierre Krack, Tobias Jülg +5

Computer Vision RLHF & Preference Learning Robotics & Embodied AI

2w ago

Finding Common Ground in a Sea of Alternatives

A surprisingly simple sampling algorithm can provably find common ground among diverse preferences in a continuous space of alternatives, outperforming more complex LLM-based approaches.

Jay Chooi, Paul Gölz, Ariel D. Procaccia +2

Constitutional AI & AI Ethics Natural Language Processing RLHF & Preference Learning

2w ago

ARISE: Agent Reasoning with Intrinsic Skill Evolution in Hierarchical Reinforcement Learning

ARISE lets language models solve math problems better by learning and reusing successful solution strategies, outperforming existing RL methods, especially on harder, out-of-distribution problems.

Yu Li, Rui Miao, Zhengling Qi +1

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

2w ago

Controlling Fish Schools via Reinforcement Learning of Virtual Fish Movement

Reinforcement learning can effectively control collective animal behavior in the real world, even when individuals frequently ignore the artificial stimulus.

Yusuke Nishii, Hiroaki Kawashima

RLHF & Preference Learning Robotics & Embodied AI World Models & Planning

Tsinghua AI2w ago

Via Negativa for AI Alignment: Why Negative Constraints Are Structurally Superior to Positive Preferences

Negative constraints offer a surprisingly robust path to AI alignment, sidestepping the sycophancy issues inherent in preference-based RLHF.

Quan Cheng

Constitutional AI & AI Ethics RLHF & Preference Learning Scalable Oversight & Alignment Theory

Eilam Shapira +22w ago

Alignment Makes Language Models Normative, Not Descriptive

Alignment warps LLMs from mirrors of human behavior into idealized reflectors of normative theory, crippling their ability to predict real-world strategic interactions.

Eilam Shapira, Moshe Tennenholtz, Roi Reichart

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

2w ago

Dual Consensus: Escaping from Spurious Majority in Unsupervised RLVR via Two-Stage Vote Mechanism

LLMs can escape the trap of converging on popular but incorrect answers in unsupervised RLVR by temporarily "unlearning" and exploring diverse response options.

Kaixuan Du, Hang Zhang, Yukun Wang +2

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

2w ago

Offline Exploration-Aware Fine-Tuning for Long-Chain Mathematical Reasoning

Supervised fine-tuning can be dramatically improved by explicitly encouraging exploration of low-confidence data and suppressing high-confidence errors, leading to sustained gains in mathematical reasoning even after extensive RLVR training.

Yongyu Mu, Jiali Zeng, Fandong Meng +2

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

Abhijit Kumar +22w ago

Execution-Grounded Credit Assignment for GRPO in Code Generation

Pinpointing the exact line of code causing a test failure boosts code generation performance by 3%, without needing a critic or extra training.

Abhijit Kumar, Natalya Kumar, Shikhar Gupta

Code Generation & Program Synthesis Eval Frameworks & Benchmarks RLHF & Preference Learning

Keru Chen +62w ago

HIPO: Instruction Hierarchy via Constrained Reinforcement Learning

LLMs can now reliably follow complex, hierarchical instructions thanks to a new constrained RL framework that treats system prompts as strict algorithmic boundaries.

Keru Chen, Jun Luo, Sen Lin +4

Constitutional AI & AI Ethics RLHF & Preference Learning

2w ago·also Fudan

DyJR: Preserving Diversity in Reinforcement Learning with Verifiable Rewards via Dynamic Jensen-Shannon Replay

By prioritizing diversity over accuracy in experience replay, DyJR significantly boosts LLM reasoning performance in RL, outperforming GRPO and other baselines without sacrificing training efficiency.

Long Li, Zhijian Zhou, Tianyi Wang +5

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

2w ago·also DAMO, CAS

Rationale Matters: Learning Transferable Rubrics via Proxy-Guided Critique for VLMReward Models

Forget expensive LLM-as-judge checks: Proxy-GRM learns transferable rubrics for vision-language reward models with a lightweight proxy, achieving SOTA results with 4x less data.

Weijie Qiu, Dai Guan, Junxin Wang +1

Multimodal Models RLHF & Preference Learning

Mar 16, 2026

2w ago·also Yunnan Normal University

Listening to the Echo: User-Reaction Aware Policy Optimization via Scalar-Verbal Hybrid Reinforcement Learning

Emotional support chatbots get a boost by learning directly from simulated user reactions, generating natural language critiques that drive better conversations.

Jing Ye, Xinpei Zhao, Lu Xiang +2

Natural Language Processing RLHF & Preference Learning

Health Information Center of Zhejiang Province2w ago

CorrectionPlanner: Self-Correction Planner with Reinforcement Learning in Autonomous Driving

Autonomous driving planners can now explicitly self-correct unsafe actions by generating motion-token traces conditioned on a learned collision critic, leading to significant safety improvements.

Yi Guo, Dongqiang Ye, Sijia Chen +2

RLHF & Preference Learning Robotics & Embodied AI World Models & Planning

Lingyu Li +22w ago

Mechanistic Origin of Moral Indifference in Language Models

LLMs exhibit a surprising degree of moral indifference, compressing distinct moral concepts into uniform probability distributions, a problem that persists across model scales, architectures, and alignment techniques.

Lingyu Li, Yan Teng, Yingchun Wang

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness+1

2w ago·also Microsoft Research

Safe Flow Q-Learning: Offline Safe Reinforcement Learning with Reachability-Based Flow Policies

SafeFQL achieves state-of-the-art safety in offline RL with significantly lower inference latency than diffusion-based methods, making it suitable for real-time safety-critical applications.

Mumuksh Tayal, Manan Tayal, Ravi Prakash

Constitutional AI & AI Ethics RLHF & Preference Learning Robotics & Embodied AI

College of Computer Science and Software Engineering2w ago·also Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)

SAGE: Multi-Agent Self-Evolution for LLM Reasoning

Forget RLHF and massive datasets: SAGE co-evolves reasoning abilities in LLMs using only a small seed set and a clever quartet of self-improving agents.

Yulin Peng, Xinxin Zhu, Chenxing Wei +4

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

2w ago

How Log-Barrier Helps Exploration in Policy Optimization

Log-barrier regularization can provably rescue policy optimization from getting stuck in suboptimal regions by structurally enforcing exploration, without sacrificing sample complexity.

Leonardo Cesani, Matteo Papini, Marcello Restelli

RLHF & Preference Learning Training Efficiency & Optimization

2w ago·also HUST

From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation

A 7B model trained with RL can outperform 72B-scale general MLLMs in robotic manipulation process supervision by explicitly reasoning about progress toward the final task goal.

Yibin Liu, Yaxing Lyu, Daqiang Gao +6

Multimodal Models Reasoning & Chain-of-Thought RLHF & Preference Learning+1

Zhenheng Tang +32w ago

Are Dilemmas and Conflicts in LLM Alignment Solvable? A View from Priority Graph

LLM alignment is fundamentally challenged by the dynamic and inconsistent nature of their internal "priority graphs," which adversaries can exploit through context manipulation.

Zhenheng Tang, Eunsol Choi, Bo Li +1

Constitutional AI & AI Ethics RLHF & Preference Learning Scalable Oversight & Alignment Theory

Xinran Zhang2w ago

Beyond Creed: A Non-Identity Safety Condition A Strong Empirical Alternative to Identity Framing in Low-Data LoRA Fine-Tuning

Ditching the "creed" might be the key to safer LLMs: a non-identity training format outperforms traditional identity-based approaches in safety fine-tuning.

Xinran Zhang

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness RLHF & Preference Learning

Zitong Xu +112w ago

EditHF-1M: A Million-Scale Rich Human Preference Feedback for Image Editing

Forget small-scale image editing datasets – this work unleashes a million-scale human preference dataset that unlocks better reward models and boosts text-guided image editing performance.

Zitong Xu, Huiyu Duan, Zhongpeng Ji +9

Computer Vision Eval Frameworks & Benchmarks Multimodal Models+1

2w ago

Amplification Effects in Test-Time Reinforcement Learning: Safety and Reasoning Vulnerabilities

Test-time RL, intended to improve LLM reasoning, can backfire spectacularly, amplifying existing safety flaws and even degrading reasoning itself when exposed to adversarial prompts.

Vanshaj Khattar, Md. Rafi Ur Rashid, Moumita Choudhury +3

Reasoning & Chain-of-Thought Red-Teaming & Adversarial Robustness RLHF & Preference Learning

2w ago·also UC Santa Cruz

Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness

Current reward models for spoken dialogue systems are missing crucial paralinguistic and natural speech elements, but this new model closes the gap by operating directly on speech and outperforming existing audio LLMs.

Jingyu Lu, Yuhan Wang, Fan Zhuo +7

Natural Language Processing RLHF & Preference Learning Speech & Audio

Jingxiang Chen +152w ago·also SMU

Aligning Paralinguistic Understanding and Generation in Speech LLMs via Multi-Task Reinforcement Learning

Speech LLMs can now better understand your emotions: a new RL approach boosts paralinguistic understanding by 8-12% over state-of-the-art models.

Jingxiang Chen, Minseok Kim, Seong-Gyun Leem +13

Reasoning & Chain-of-Thought RLHF & Preference Learning Speech & Audio

2w ago

Writer-R1: Enhancing Generative Writing in LLMs via Memory-augmented Replay Policy Optimization

LLMs can now write better stories, poems, and scripts, thanks to a new training method that uses AI to automatically generate its own writing feedback criteria.

Jihao Zhao, Shuaishuai Zu, Zhiyuan Ji +2

Natural Language Processing RLHF & Preference Learning

Xincheng Shuai +22w ago

GlyphPrinter: Region-Grouped Direct Preference Optimization for Glyph-Accurate Visual Text Rendering

Achieve glyph-accurate visual text rendering by training a model to directly optimize for regional glyph preferences, sidestepping the limitations of text recognition-based reward models.

Xincheng Shuai, Ziye Li, Dacheng Tao

Computer Vision Multimodal Models RLHF & Preference Learning

Rushil Thareja +32w ago

MAC: Multi-Agent Constitution Learning

Forget hand-crafted rules: MAC learns interpretable LLM constitutions that beat prompt optimization by 50% and rival fine-tuning, all without parameter updates.

Rushil Thareja, Gautam Gupta, Francesco Pinto +1

Constitutional AI & AI Ethics RLHF & Preference Learning

Aozhe Wang +52w ago

Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learning

By adversarially co-evolving code and test LLMs, Code-A1 achieves code generation performance on par with human-annotated training, while simultaneously boosting the LLM's ability to find bugs.

Aozhe Wang, Nan Zhou, Zhengxi Lu +3

Code Generation & Program Synthesis Red-Teaming & Adversarial Robustness RLHF & Preference Learning

Mar 15, 2026

2w ago·also BAIR, HuggingFace, McMaster University, Oxford

AI Can Learn Scientific Taste

Forget benchmarks: AI can now learn "scientific taste" and propose research ideas with higher potential impact than humans, thanks to a novel reinforcement learning approach using citation data.

Jingqi Tong, Mingzhe Li, Hangcheng Li +18

RLHF & Preference Learning Scientific Discovery & Drug Design

Suvadeep Hajra +22w ago

Exposing Long-Tail Safety Failures in Large Language Models through Efficient Diverse Response Sampling

LLM safety failures aren't always about the prompt—exploring diverse model outputs for a fixed prompt can drive jailbreak success rates close to 100%.

Suvadeep Hajra, Palash Nandi, Tanmoy Chakraborty

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness RLHF & Preference Learning

Omer Nahum +42w ago

Motivation in Large Language Models

LLMs aren't just mimicking text, they're exhibiting internal "motivation" that predicts their choices and performance, just like humans.

Omer Nahum, Asael Y. Sklar, Asael Sklar +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing+1

Mar 14, 2026

Haitao Jiang +52w ago

Supervised Fine-Tuning versus Reinforcement Learning: A Study of Post-Training Methods for Large Language Models

SFT and RL, often seen as distinct, are converging in LLM post-training, with hybrid approaches now dominating—but understanding when to use each remains crucial.

Haitao Jiang, Wenbo Zhang, Jiarui Yao +3

Eval Frameworks & Benchmarks RLHF & Preference Learning Training Efficiency & Optimization

Mar 13, 2026

CMU ML2w ago

Mending the Holes: Mitigating Reward Hacking in Reinforcement Learning for Multilingual Translation

Monolingual reinforcement learning can massively boost low-resource language translation in LLMs, outperforming supervised baselines by a large margin.

Yifeng Liu, Siqi Ouyang, Yatish Hosmane Revanasiddappa +1

Natural Language Processing RLHF & Preference Learning

2w ago·also BUPT, Fudan, NTU, Shanghai AI Lab

Visual-ERM: Reward Modeling for Visual Equivalence

Forget textual rules and coarse embeddings: a multimodal reward model that directly compares rendered visuals unlocks significant gains in vision-to-code RL.

Ziyu Liu, Shengyuan Ding, Shengyuan Ding +7

Computer Vision Multimodal Models RLHF & Preference Learning

2w ago·also Stanford HAI, Independent Researcher, NTU, Shanghai Jiaotong University

From Sparse to Dense: Multi-View GRPO for Flow Models via Augmented Condition Space

Text-to-image flow models can achieve superior preference alignment by augmenting the condition space, creating a "dense" reward mapping that better captures inter-sample relationships.

Jiazi Bu, Jiazi Bu, Pengyang Ling +12

Computer Vision Multimodal Models RLHF & Preference Learning

Mar 12, 2026

2w ago

Entropy-Preserving Reinforcement Learning

Policy gradient methods may be self-defeating in language model reasoning, as their inherent entropy reduction chokes off exploration and limits downstream performance.

Aleksei Petrenko, Ben Lipkin, Kevin Chen +4

Natural Language Processing RLHF & Preference Learning Training Efficiency & Optimization

2w ago·also NTU

Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation

Hallucinations in RL-based image editing and generation are tamed with FIRM, a new framework that trains robust reward models on curated datasets to provide more accurate guidance.

Xiangyu Zhao, Peiyuan Zhang, Junming Lin +8

Computer Vision Multimodal Models RLHF & Preference Learning

2w ago

Can RL Improve Generalization of LLM Agents? An Empirical Study

RFT's impressive in-domain performance masks surprisingly weak generalization to new environments, highlighting a critical challenge for deploying LLM agents in the real world.

Zhiheng Xi, Jiazheng Zhang, Yutao Fan +8

Eval Frameworks & Benchmarks RLHF & Preference Learning Tool Use & Agents

2w ago

On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents

RL-trained LLM agents can get stuck in an "information self-locking" trap, failing to ask the right questions and internalize information, but a simple learning signal reallocation can break them out.

Deyu Zou, Yongqiang Chen, Fan Feng +5

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Junjie Wu +62w ago

Multi-Task Reinforcement Learning for Enhanced Multimodal LLM-as-a-Judge

MLLMs can now judge more consistently and generalize better thanks to a multi-task reinforcement learning approach that aligns them with human preferences across diverse visual tasks.

Junjie Wu, Xuan Kan, Zihao He +4

Eval Frameworks & Benchmarks Multimodal Models RLHF & Preference Learning

CMU ML2w ago·also Petuum

IsoCompute Playbook: Optimally Scaling Sampling Compute for LLM RL

Forget simple scaling laws: the compute-optimal number of parallel rollouts in LLM RL plateaus, revealing distinct mechanisms for easy vs. hard problems.

Zhoujun Cheng, Yutao Xie, Yuxiao Qu +16

RLHF & Preference Learning Scaling Laws & Emergent Abilities Training Efficiency & Optimization

2w ago·also Tsinghua AI, The 39th Research Institute of China, XJTU

Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training

Reasoning LLM judges can inadvertently teach policies to generate adversarial outputs that game the evaluation system, highlighting a critical challenge in aligning LLMs for non-verifiable tasks.

Yixin Liu, Yuehua Yu, Yue Yu +9

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought RLHF & Preference Learning

Yijun Pan +62w ago

FlexRec: Adapting LLM-based Recommenders for Flexible Needs via Reinforcement Learning

LLM-based recommenders can be dramatically improved (up to 109% Recall@5) by using counterfactual rewards and uncertainty-aware scaling within a reinforcement learning framework, enabling flexible adaptation to diverse recommendation scenarios.

Yijun Pan, Weikang Qiu, Qiyao Ma +4

Recommendation & Information Retrieval RLHF & Preference Learning

Yuetian Du +72w ago·also M-A-P, Peng Cheng Laboratory

Linking Perception, Confidence and Accuracy in MLLMs

MLLMs are often overconfident, but a new confidence-driven training and test-time scaling approach can boost accuracy by 8.8% across benchmarks.

Yuetian Du, Yucheng Wang, Rongyu Zhang +5

Eval Frameworks & Benchmarks Multimodal Models RLHF & Preference Learning

School of Electrical and Computer Engineering2w ago·also IEEE

Exploiting Expertise of Non-Expert and Diverse Agents in Social Bandit Learning: A Free Energy Approach

Learning from others doesn't require knowing who's an expert: this social bandit algorithm figures it out and improves performance even with non-experts in the mix.

Erfan Mirzaei, S. P. Shariatpanahi, A. Tavakoli +2

Recommendation & Information Retrieval RLHF & Preference Learning Tool Use & Agents

2w ago·also Independent Researcher

Automatic Generation of High-Performance RL Environments

Automating RL environment engineering slashes costs and unlocks massive speedups (up to 22,320x!) using a recipe of prompt engineering, verification, and agent-assisted repair.

Seth Karten, Rahul Dev Appapogu, Chi Jin

Code Generation & Program Synthesis RLHF & Preference Learning Tool Use & Agents+1

Xiquan Li +42w ago

Resonate: Reinforcing Text-to-Audio Generation via Online Feedback from Large Audio Language Models

Online reinforcement learning with large audio language model rewards catapults text-to-audio generation to a new state-of-the-art, even with a relatively small 470M parameter model.

Xiquan Li, Junxi Liu, Haina Zhu +2

Multimodal Models RLHF & Preference Learning Speech & Audio

Mar 11, 2026

3w ago

DeReason: A Difficulty-Aware Curriculum Improves Decoupled SFT-then-RL Training for General Reasoning

Stop wasting RL on easy problems: a difficulty-aware curriculum for SFT and RL unlocks better reasoning in LLMs.

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

CMU ML3w ago·also Keio, Preferred Networks

Learning to Assist: Physics-Grounded Human-Human Control via Multi-Agent Reinforcement Learning

AssistMimic enables humanoid robots to learn complex, force-exchanging assistive motions by reformulating imitation learning as a multi-agent RL problem.

Yuto Shibata, Kashu Yamazaki, Lalit Jayanti +3

RLHF & Preference Learning Robotics & Embodied AI

Nolan Chan +43w ago

V2A-DPO: Omni-Preference Optimization for Video-to-Audio Generation

Human-preference aligned audio generation from video is now possible, as V2A-DPO surpasses previous methods by directly optimizing for semantic consistency, temporal alignment, and perceptual quality.

Nolan Chan, Timmy Gang, Yongqian Wang +2

Multimodal Models RLHF & Preference Learning Speech & Audio

Yuning Wu +23w ago

Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings

By selectively injecting teacher demonstrations only during failure, HAPO overcomes the limitations of both pure RL and mixed-policy optimization in sparse-reward RLVR, enabling models to surpass static teacher forcing.

Yuning Wu, Devin Chen, Kaichen Wei

RLHF & Preference Learning Training Efficiency & Optimization

Mingyang Song +23w ago

Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge

LLM-as-a-judge consensus is often an illusion: models agree on surface-level features, but diverge wildly when evaluating true quality, a problem fixable by injecting domain knowledge into rubrics.

Mingyang Song, Mao Zheng, Chenning Xu

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

Yuanhong Wu +23w ago

Enhancing Value Alignment of LLMs with Multi-agent system and Combinatorial Fusion

LLMs can be better aligned to human values by fusing the outputs of multiple "moral agents" representing diverse ethical perspectives, outperforming single-agent approaches.

Yuanhong Wu, Djallel Bouneffouf, D. Frank Hsu

Constitutional AI & AI Ethics RLHF & Preference Learning Tool Use & Agents

Search

RLHF & Preference Learning - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (75)