March 18 – March 25, 2026

RLHF & Preference Learning - Weekly Roundup

42 papers published across 6 labs.

28% acceleration

Selected Labs publishing this week

NVIDIA2 Meta AI2 BAIR1 Microsoft Research1 MIT CSAIL1

Top Papers

Mar 19, 2026

Zening Sun +51w ago

CRAFT: Aligning Diffusion Models with Fine-Tuning Is Easier Than You Think

Aligning diffusion models with just 100 carefully selected samples can beat state-of-the-art preference optimization methods trained on thousands, and converge up to 220x faster.

Zening Sun, Zhengpeng Xie, Lichen Bai +3

Computer Vision Multimodal Models RLHF & Preference Learning

Mar 25, 2026

BAIR1w ago·also Microsoft Research, IIT

Composer 2 Technical Report

Training domain-specific coding LLMs with realistic environments and large-scale RL can yield substantial gains in practical software engineering tasks.

Cursor Reseach Aaron Chan, Ahmed Shalaby, Alexander Wettig +51

Code Generation & Program Synthesis RLHF & Preference Learning Tool Use & Agents

MIT CSAIL1w ago

Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models

LMs can learn to generate multiple plausible answers in a single forward pass, outperforming traditional single-answer models on tasks requiring distributional reasoning and offering a compute-efficient alternative to best-of-k sampling.

Isha Puri, Mehul Damani, Idan Shenfeld +3

Natural Language Processing Reasoning & Chain-of-Thought RLHF & Preference Learning

Mar 20, 2026

Chiyu Ma +91w ago

FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

LLMs can reason through chains of thought 2.5x longer and achieve 8% higher accuracy on complex math problems by optimizing for token-level influence on future trajectory behavior.

Chiyu Ma, Shuo Yang, Kexin Huang +7

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

Mar 19, 2026

1w ago

Context Bootstrapped Reinforcement Learning

Injecting demonstrations with a carefully annealed probability can drastically improve exploration in RLVR, even for tasks requiring novel reasoning or domain-specific knowledge.

Saaket Agashe, Jayanth Srinivasa, Gaowen Liu +4

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents+1

All Papers (42)

Mar 25, 2026

BAIR1w ago·also Microsoft Research, IIT

Composer 2 Technical Report

Training domain-specific coding LLMs with realistic environments and large-scale RL can yield substantial gains in practical software engineering tasks.

Cursor Reseach Aaron Chan, Ahmed Shalaby, Alexander Wettig +51

Code Generation & Program Synthesis RLHF & Preference Learning Tool Use & Agents

MIT CSAIL1w ago

Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models

Isha Puri, Mehul Damani, Idan Shenfeld +3

Natural Language Processing Reasoning & Chain-of-Thought RLHF & Preference Learning

Mar 20, 2026

Chiyu Ma +91w ago

FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

LLMs can reason through chains of thought 2.5x longer and achieve 8% higher accuracy on complex math problems by optimizing for token-level influence on future trajectory behavior.

Chiyu Ma, Shuo Yang, Kexin Huang +7

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

Mar 19, 2026

1w ago

Context Bootstrapped Reinforcement Learning

Injecting demonstrations with a carefully annealed probability can drastically improve exploration in RLVR, even for tasks requiring novel reasoning or domain-specific knowledge.

Saaket Agashe, Jayanth Srinivasa, Gaowen Liu +4

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents+1

Hao Wang +91w ago

CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks

Observational user feedback, often dismissed as too noisy and biased, can actually power effective RLHF with the right causal modeling, achieving a 49.2% gain on WildGuardMix.

Hao Wang, Licheng Pan, Zhichao Chen +7

RLHF & Preference Learning

Wenxuan Zhang +131w ago

dTRPO: Trajectory Reduction in Policy Optimization of Diffusion Large Language Models

Scale up offline policy training for diffusion LLMs without breaking the bank: dTRPO slashes trajectory computation costs while boosting performance up to 9.6% on STEM tasks.

Wenxuan Zhang, Lemeng Wu, Changsheng Zhao +11

Natural Language Processing RLHF & Preference Learning Training Efficiency & Optimization

Mila1w ago

Learning to Self-Evolve

Forget prompt engineering – LSE trains LLMs to self-edit their own contexts at test time, outperforming even GPT-5 and Claude Sonnet 4.5 in Text-to-SQL and question answering.

Xiaoyin Chen, Xiaoyin Chen, Canwen Xu +9

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Yipu Dou +11w ago

MOSAIC: Multi-Objective Slice-Aware Iterative Curation for Alignment

Forget random data mixing: MOSAIC uses failure analysis to intelligently curate training data, leading to better safety, less over-refusal, and improved instruction following, all at once.

Yipu Dou, Wang Yang

Constitutional AI & AI Ethics Data Curation & Synthetic Data RLHF & Preference Learning

Ruishuo Chen +71w ago

PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching

Unleashing an LLM's inner creativity or laser-sharp logic is now as simple as turning a knob, thanks to a new distribution-matching method that avoids heuristic rewards.

Ruishuo Chen, Ruishuo Chen, Yu Chen +5

Natural Language Processing RLHF & Preference Learning Training Efficiency & Optimization

Masayuki Kawarada +21w ago

GAIN: A Benchmark for Goal-Aligned Decision-Making of Large Language Models under Imperfect Norms

LLMs surprisingly prioritize norm adherence over personal incentives in business scenarios, challenging assumptions about goal-driven behavior.

Masayuki Kawarada, Kodai Watanabe, Soichiro Murakami

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

NVIDIA1w ago

ProRL Agent: Rollout-as-a-Service for RL Training of Multi-Turn LLM Agents

Training multi-turn LLM agents just got easier: ProRL Agent offers a scalable, API-driven rollout service that streamlines RL training across diverse tasks.

Hao Zhang, Mingjie Liu, Shaokun Zhang +12

Distributed Systems & Hardware RLHF & Preference Learning Tool Use & Agents

Channe Chwa +21w ago

Automatic Configuration of LLM Post-Training Pipelines

LLM post-training pipelines can be configured with 10x less compute using AutoPipe, a budget-aware framework that learns from historical runs and predicts performance from early training signals.

Channe Chwa, Xinle Wu, Yao Lu

RLHF & Preference Learning Training Efficiency & Optimization

Lourdes Moreno +21w ago

A Human-in/on-the-Loop Framework for Accessible Text Generation

Human oversight can be systematically integrated into LLM-based text generation to improve accessibility, creating a traceable and auditable process.

Lourdes Moreno, P. Mart'inez, Paloma Martínez

Eval Frameworks & Benchmarks Natural Language Processing RLHF & Preference Learning

Meta AI1w ago

TARo: Token-level Adaptive Routing for LLM Test-time Alignment

Achieve significant reasoning gains in frozen LLMs (+22.4%) without retraining by adaptively routing reward model guidance at the token level during inference.

Arushi Rai, Qiang Zhang, Hanqing Zeng +4

Inference & Quantization Reasoning & Chain-of-Thought RLHF & Preference Learning

1w ago

Adaptive Decoding via Test-Time Policy Learning for Self-Improving Generation

Forget fixed decoding strategies – RL can learn a lightweight policy to adapt LLM sampling *at test time*, boosting summarization quality by up to 88% without retraining the LLM.

Asmita Bhardwaj, Yuya Jeremy Ong, Eelaaf Zahid +1

Inference & Quantization Natural Language Processing RLHF & Preference Learning

Mingyang Liu +111w ago

Online Learning and Equilibrium Computation with Ranking Feedback

Learning from ranked preferences alone can be surprisingly difficult: even with access to the full ranking of actions, standard online learning guarantees break down unless the environment is sufficiently stable.

Mingyang Liu, Mingyang Liu, Yongshang Chen +9

Natural Language Processing Recommendation & Information Retrieval RLHF & Preference Learning

Mohammad Al Ridhawi +31w ago

Adaptive Regime-Aware Stock Price Prediction Using Autoencoder-Gated Dual Node Transformers with Reinforcement Learning Control

Forget static models: this adaptive framework slashes stock price prediction error by dynamically routing data through specialized pathways based on real-time market regime detection.

Mohammad Al Ridhawi, M. Ali, Mahtab Haj Ali +1

Architecture Design (Transformers, SSMs, MoE)RLHF & Preference Learning Training Efficiency & Optimization

Zening Sun +51w ago

CRAFT: Aligning Diffusion Models with Fine-Tuning Is Easier Than You Think

Aligning diffusion models with just 100 carefully selected samples can beat state-of-the-art preference optimization methods trained on thousands, and converge up to 220x faster.

Zening Sun, Zhengpeng Xie, Lichen Bai +3

Computer Vision Multimodal Models RLHF & Preference Learning

Gabriele Carrino +61w ago

Are complicated loss functions necessary for teaching LLMs to reason?

Stripping away the complexity of GRPO reveals that simple REINFORCE with group relative advantage can actually *improve* LLM reasoning, challenging the assumption that sophisticated loss functions are always better.

Gabriele Carrino, Andrea Sassella, Nicolò Brunello +4

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

1w ago

From Accuracy to Readiness: Metrics and Benchmarks for Human-AI Decision-Making

Human-AI teams often fail not because AI is inaccurate, but because humans miscalibrate their reliance on it, highlighting the need for readiness metrics beyond accuracy.

Min Hun Lee

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

1w ago·also Cornell, Institute of Science Tokyo, LY Corporation, Meiji University +2

Off-Policy Learning with Limited Supply

Greedy off-policy learning, optimal in theory, can fail spectacularly when supplies are limited, but a simple fix—prioritizing items with high *relative* reward—can restore performance.

Koichi Tanaka, Ren Kishimoto, Bushun Kawagishi +4

Natural Language Processing Recommendation & Information Retrieval RLHF & Preference Learning

Chonghan Liu +121w ago

VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation Models

Low-resource language models can get a major boost in translation quality and tokenization efficiency by using reinforcement learning to directly enforce structural constraints like sequence length and linguistic well-formedness during training.

Chonghan Liu, Yiming Du, Yimin Du +10

Natural Language Processing RLHF & Preference Learning Training Efficiency & Optimization

1w ago

ARIADNE: A Perception-Reasoning Synergy Framework for Trustworthy Coronary Angiography Analysis

Achieve topologically coherent coronary vessel segmentation by directly optimizing for geometric structure, rather than pixel-wise accuracy, using preference-based learning.

Zhan Jin, Zhanpeng Jin, Yuchen Luo +7

Computer Vision Reasoning & Chain-of-Thought RLHF & Preference Learning+1

Yinan Xia +31w ago

Balancing the Reasoning Load: Difficulty-Differentiated Policy Optimization with Length Redistribution for Efficient and Robust Reinforcement Learning

LRMs can be made more efficient and accurate by strategically adjusting their output length based on task difficulty, leading to a better accuracy-length trade-off.

Yinan Xia, Haotian Zhang, Huimin Wang +1

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

1w ago

SQL-Commenter: Aligning Large Language Models for SQL Comment Generation with Direct Preference Optimization

Forget struggling with cryptic SQL: a new LLM fine-tuned with human preferences generates comments so good, they beat Qwen3-14B by up to 13% on standard metrics.

Lei Yu, Jingyuan Zhang, Xin Wang +5

Code Generation & Program Synthesis Natural Language Processing RLHF & Preference Learning

1w ago·also CUHK, HKUST, Shanghai AI Lab

OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards

Decomposing GUI agent trajectories into verifiable milestones and auditing the evidence chain yields a 10% boost in RL training performance, outperforming single-judge reward systems.

Zehao Li, Zhenyu Wu, Zhenyu Wu +23

RLHF & Preference Learning Robotics & Embodied AI Tool Use & Agents

Huichi Zhou +171w ago·also Jilin Univerisity

Memento-Skills: Let Agents Design Agents

Forget hand-crafting agents: Memento-Skills lets a generalist LLM agent autonomously design and improve specialized agents through experience, achieving substantial gains on complex benchmarks.

Huichi Zhou, Siyuan Guo, Anjie Liu +15

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Meta AI1w ago·also CMU ML, CAS, UNC

Reasoning over mathematical objects: on-policy reward modeling and test time aggregation

On-policy reward modeling with LLM judges not only unlocks significant performance gains on complex mathematical reasoning tasks, but also generalizes to improve performance on simpler numerical and multiple-choice benchmarks.

Pranjal Aggarwal, Marjan Ghazvininejad, Seungone Kim +22

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought RLHF & Preference Learning

NVIDIA1w ago·also HKUST, Waterloo

Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation

A 30B MoE model can now achieve Gold Medal-level performance in IMO, IOI, and ICPC, rivaling frontier models with 20x more parameters.

Zhuolin Yang, Zhuoling Yang, Zihan Liu +29

Code Generation & Program Synthesis Reasoning & Chain-of-Thought RLHF & Preference Learning

1w ago·also University of Orléans

Best-of-Both-Worlds Multi-Dueling Bandits: Unified Algorithms for Stochastic and Adversarial Preferences under Condorcet and Borda Objectives

Imagine a single algorithm that dominates in both predictable and chaotic ranking scenarios – this paper delivers it for multi-dueling bandits.

S. Akash, Pratik Gajane, Jawar Singh

Recommendation & Information Retrieval RLHF & Preference Learning

Xiao Feng +71w ago

RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models

Skip the expensive reward model: RewardFlow distills sparse task rewards into dense, state-level signals by propagating credit through the topology of LLM reasoning trajectories.

Xiao Feng, Bo Han, Zhanke Zhou +5

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Zhicong Lu +101w ago

HISR: Hindsight Information Modulated Segmental Process Rewards For Multi-turn Agentic Reinforcement Learning

Aligning rewards with sub-goals and emphasizing key trajectory segments with hindsight information significantly improves multi-turn agentic RL, outperforming existing methods on complex tasks.

Zhicong Lu, Zichuan Lin, Wei Jia +8

RLHF & Preference Learning Tool Use & Agents

Mar 18, 2026

Tharun Sethuraman +52w ago

Interpreting Context-Aware Human Preferences for Multi-Objective Robot Navigation

Robots can now navigate based on your spoken preferences and visual context, thanks to a clever fusion of VLMs, LLMs, and multi-objective RL.

Tharun Sethuraman, Subham Agrawal, Nils Dengler +3

Natural Language Processing RLHF & Preference Learning Robotics & Embodied AI

Chengwei Wei +42w ago

InfoDensity: Rewarding Information-Dense Traces for Efficient Reasoning

LLMs can achieve state-of-the-art reasoning accuracy with significantly fewer tokens by rewarding intermediate reasoning steps that maximize information gain and maintain monotonic progress.

Chengwei Wei, Jung-jae Kim, Longyin Zhang +2

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

Jiashun Liu +12w ago

Complementary Reinforcement Learning

RL agents can learn far more efficiently by dynamically distilling and leveraging past experiences that co-evolve with the agent's growing capabilities.

Jiashun Liu, Bo Zheng

RLHF & Preference Learning Tool Use & Agents Training Efficiency & Optimization

Hao Ma +32w ago

Efficient Soft Actor-Critic with LLM-Based Action-Level Guidance for Continuous Control

LLMs can act as effective action-level supervisors in reinforcement learning, dramatically boosting the sample efficiency of SAC without sacrificing convergence guarantees.

Hao Ma, Zhiqiang Pu, Xiaolin Ai +1

RLHF & Preference Learning Robotics & Embodied AI Tool Use & Agents

Haozheng Luo +32w ago

Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations

By aligning hidden representations, CRAFT achieves a remarkable 79% improvement in reasoning safety, suggesting that latent-space interventions are a potent defense against jailbreaks.

Haozheng Luo, Yimin Wang, Jiahao Yu +1

Reasoning & Chain-of-Thought Red-Teaming & Adversarial Robustness RLHF & Preference Learning

2w ago

CoVerRL: Breaking the Consensus Trap in Label-Free Reasoning via Generator-Verifier Co-Evolution

LLMs can escape the trap of confidently wrong reasoning by co-evolving a generator and verifier from a single model, bootstrapping each other to break free from flawed consensus.

Teng Pan, Tengyu Pan, Yuchen Yan +7

Reasoning & Chain-of-Thought RLHF & Preference Learning

Guanlin Feng +12w ago

AR-CoPO: Align Autoregressive Video Generation with Contrastive Policy Optimization

RLHF for autoregressive video generation gets a boost with AR-CoPO, which overcomes the limitations of SDE-based methods by using chunk-level alignment and a semi-on-policy training strategy.

Guanlin Feng, Hongsheng Li

Computer Vision Multimodal Models RLHF & Preference Learning

Corentin Royer +52w ago

Process Supervision for Chain-of-Thought Reasoning via Monte Carlo Net Information Gain

Forget expensive human annotations – this new method uses information theory to automatically score each step of an LLM's reasoning process, making chain-of-thought supervision scalable and efficient.

Corentin Royer, D. Bhattacharjya, Debarun Bhattacharjya +3

Reasoning & Chain-of-Thought RLHF & Preference Learning

S. Asghari +92w ago

Efficient Exploration at Scale

Online RLHF can match the performance of offline RLHF with 10x less data, and potentially 1000x at scale.

S. Asghari, Seyed Mohammad Asghari, Chris Chute +7

RLHF & Preference Learning Training Efficiency & Optimization

2w ago·also McGill, Purdue

DriveVLM-RL: Neuroscience-Inspired Reinforcement Learning with Vision-Language Models for Safe and Deployable Autonomous Driving

Autonomous vehicles can now leverage the rich semantic understanding of VLMs for safer driving without the computational overhead, thanks to a clever training strategy that distills VLM knowledge into a real-time RL policy.

Zilin Huang, Zihao Sheng, Zhengyan Wan +3

Multimodal Models RLHF & Preference Learning Robotics & Embodied AI

Search

RLHF & Preference Learning - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (42)