March 4 – March 11, 2026

RLHF & Preference Learning - Weekly Roundup

66 papers published across 9 labs.

28% acceleration

Selected Labs publishing this week

CMU ML5 Tsinghua AI5 Microsoft Research3 Stanford HAI2 ETH1

Top Papers

Mar 11, 2026

3w ago

DeReason: A Difficulty-Aware Curriculum Improves Decoupled SFT-then-RL Training for General Reasoning

Stop wasting RL on easy problems: a difficulty-aware curriculum for SFT and RL unlocks better reasoning in LLMs.

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

CMU ML3w ago·also Keio, Preferred Networks

Learning to Assist: Physics-Grounded Human-Human Control via Multi-Agent Reinforcement Learning

AssistMimic enables humanoid robots to learn complex, force-exchanging assistive motions by reformulating imitation learning as a multi-agent RL problem.

Yuto Shibata, Kashu Yamazaki, Lalit Jayanti +3

RLHF & Preference Learning Robotics & Embodied AI

Nolan Chan +43w ago

V2A-DPO: Omni-Preference Optimization for Video-to-Audio Generation

Human-preference aligned audio generation from video is now possible, as V2A-DPO surpasses previous methods by directly optimizing for semantic consistency, temporal alignment, and perceptual quality.

Nolan Chan, Timmy Gang, Yongqian Wang +2

Multimodal Models RLHF & Preference Learning Speech & Audio

Yuning Wu +23w ago

Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings

By selectively injecting teacher demonstrations only during failure, HAPO overcomes the limitations of both pure RL and mixed-policy optimization in sparse-reward RLVR, enabling models to surpass static teacher forcing.

Yuning Wu, Devin Chen, Kaichen Wei

RLHF & Preference Learning Training Efficiency & Optimization

Mingyang Song +23w ago

Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge

LLM-as-a-judge consensus is often an illusion: models agree on surface-level features, but diverge wildly when evaluating true quality, a problem fixable by injecting domain knowledge into rubrics.

Mingyang Song, Mao Zheng, Chenning Xu

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

All Papers (66)

Mar 11, 2026

3w ago

DeReason: A Difficulty-Aware Curriculum Improves Decoupled SFT-then-RL Training for General Reasoning

Stop wasting RL on easy problems: a difficulty-aware curriculum for SFT and RL unlocks better reasoning in LLMs.

Reasoning & Chain-of-Thought RLHF & Preference Learning Training Efficiency & Optimization

CMU ML3w ago·also Keio, Preferred Networks

Learning to Assist: Physics-Grounded Human-Human Control via Multi-Agent Reinforcement Learning

AssistMimic enables humanoid robots to learn complex, force-exchanging assistive motions by reformulating imitation learning as a multi-agent RL problem.

Yuto Shibata, Kashu Yamazaki, Lalit Jayanti +3

RLHF & Preference Learning Robotics & Embodied AI

Nolan Chan +43w ago

V2A-DPO: Omni-Preference Optimization for Video-to-Audio Generation

Nolan Chan, Timmy Gang, Yongqian Wang +2

Multimodal Models RLHF & Preference Learning Speech & Audio

Yuning Wu +23w ago

Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings

Yuning Wu, Devin Chen, Kaichen Wei

RLHF & Preference Learning Training Efficiency & Optimization

Mingyang Song +23w ago

Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge

LLM-as-a-judge consensus is often an illusion: models agree on surface-level features, but diverge wildly when evaluating true quality, a problem fixable by injecting domain knowledge into rubrics.

Mingyang Song, Mao Zheng, Chenning Xu

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

Yuanhong Wu +23w ago

Enhancing Value Alignment of LLMs with Multi-agent system and Combinatorial Fusion

LLMs can be better aligned to human values by fusing the outputs of multiple "moral agents" representing diverse ethical perspectives, outperforming single-agent approaches.

Yuanhong Wu, Djallel Bouneffouf, D. Frank Hsu

Constitutional AI & AI Ethics RLHF & Preference Learning Tool Use & Agents

Mar 10, 2026

Tiehua Mei +73w ago

Good Reasoning Makes Good Demonstrations: Implicit Reasoning Quality Supervision via In-Context Reinforcement Learning

Stop training LLMs on lucky guesses: this new RL method uses the model's own in-context learning ability to identify and upweight high-quality reasoning traces, leading to better performance.

Tiehua Mei, Minxuan Lv, Leiyu Pan +5

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

ETH3w ago

ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning

Get 6x the RLHF alignment for your LLM with a new active learning pipeline that focuses on annotating the most informative response pairs.

Data Curation & Synthetic Data RLHF & Preference Learning Training Efficiency & Optimization

Nicolás Della Penna3w ago

What Do We Care About in Bandits with Noncompliance? BRACE: Bandits with Recommendations, Abstention, and Certified Effects

Recommendation welfare can provably exceed any learner-measurable treatment policy when downstream actors possess private information, forcing a critical re-evaluation of learning objectives in bandit settings with noncompliance.

Nicolás Della Penna

Recommendation & Information Retrieval RLHF & Preference Learning

Tsinghua AI3w ago·also PKU

Video-Based Reward Modeling for Computer-Use Agents

A new video-based reward model beats GPT-5.2 and Gemini-3 Pro at evaluating computer-using agents, offering a scalable, model-agnostic alternative to traditional methods.

Linxin Song, Jieyu Zhang, Huanxin Sheng +6

Computer Vision RLHF & Preference Learning Tool Use & Agents

K. SwaminathanS +13w ago

SPAARS: Safer RL Policy Alignment through Abstract Exploration and Refined Exploitation of Action Space

Escape the CVAE decoder bottleneck: SPAARS unlocks better offline-to-online RL by safely exploring a latent space, then seamlessly switching to raw actions.

K. SwaminathanS, Aritra Hazra

RLHF & Preference Learning Robotics & Embodied AI

Tzu-Heng Huang +43w ago

RubiCap: Rubric-Guided Reinforcement Learning for Dense Image Captioning

Forget expensive human annotations: RubiCap uses LLM-generated rubrics to train image captioning models via RL, achieving superhuman performance and even improving VLM pretraining.

Tzu-Heng Huang, Sirajul Salekin, Javier Movellan +2

Computer Vision Multimodal Models RLHF & Preference Learning

3w ago

CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR

LLMs can be steered away from hallucination and towards more robust reasoning by using contrastive learning to capture the shared structure of successful reasoning paths, even when the final answer is correct.

Reasoning & Chain-of-Thought RLHF & Preference Learning

Heng Zhang +33w ago

Reward-Zero: Language Embedding Driven Implicit Reward Mechanisms for Reinforcement Learning

Forget hand-engineered reward functions: Reward-Zero uses language embeddings to give RL agents an intrinsic "sense of completion," dramatically improving sample efficiency and generalization.

Heng Zhang, Haddy Alchaer, Arash Ajoudani +1

Natural Language Processing RLHF & Preference Learning Robotics & Embodied AI

Yinjie Wang +43w ago

OpenClaw-RL: Train Any Agent Simply by Talking

Forget finetuning on curated datasets – OpenClaw-RL lets agents learn directly and continuously from *every* interaction, turning user replies, tool outputs, and even GUI changes into valuable RL signals.

Yinjie Wang, Xuyang Chen, Xiaolong Jin +2

Code Generation & Program Synthesis RLHF & Preference Learning Tool Use & Agents

3w ago·also Beihang

Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization

Unlock multimodal interleaved generation in existing vision-language models without large interleaved datasets using a novel reinforcement learning approach with hybrid rewards.

Ming Nie, Chunwei Wang, Jianhua Han +2

Multimodal Models Reasoning & Chain-of-Thought RLHF & Preference Learning

CMU ML3w ago·also CAS

Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

LLMs trained with reinforcement learning from verifiable rewards (RLVR) become overconfident in incorrect answers, but a simple fix—decoupling reasoning and calibration objectives—can restore proper calibration without sacrificing accuracy.

Zheng Ma, Zhengzhao Ma, Xueru Wen +9

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought RLHF & Preference Learning

Mar 9, 2026

3w ago

Toward Robust LLM-Based Judges: Taxonomic Bias Evaluation and Debiasing Optimization

LLM-based judges, widely used for automated evaluation, are riddled with diverse biases that can be significantly reduced through bias-aware training using RL and contrastive learning.

Hongli Zhou, Hui Huang, Kehai Chen +3

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

3w ago

Impact of Different Failures on a Robot's Perceived Reliability

Robot mistakes are surprisingly forgiving – humans penalize slips and freezes more harshly, and sometimes even interpret certain mistakes as successes.

Andrew Violette, Zhanxin Wu, Haruki Nishimura +5

RLHF & Preference Learning Robotics & Embodied AI

3w ago

Improving through Interaction: Searching Behavioral Representation Spaces with CMA-ES-IG

Users prefer robots that learn their preferences using CMA-ES-IG because it suggests more perceptually distinct and informative behaviors to rank.

Nathaniel Dennler, Zhonghao Shi, Yiran Tao +3

RLHF & Preference Learning Robotics & Embodied AI

Tsinghua AI3w ago·also DAMO, HIT

Revealing Behavioral Plasticity in Large Language Models: A Token-Conditional Perspective

LLMs can switch between reasoning and factual answering on the fly, without retraining, simply by conditioning on specific token prefixes.

Liyuan Mao, Le Yu, Jingren Zhou +8

Natural Language Processing Reasoning & Chain-of-Thought RLHF & Preference Learning

3w ago

RetroAgent: From Solving to Evolving via Retrospective Dual Intrinsic Feedback

LLM agents can learn to continuously adapt and improve in complex environments by reflecting on past experiences and explicitly storing/retrieving reusable lessons, leading to substantial performance gains.

Xiaoying Zhang, Zichen Liu, Zi-Yan Liu +3

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Mingxi Zou +53w ago

DARC: Disagreement-Aware Alignment via Risk-Constrained Decoding

Mitigate the brittleness of RLHF by explicitly controlling for disagreement and tail risk during inference, without retraining, using a KL-robust optimization framework.

Mingxi Zou, Jiaxiang Chen, Junfan Li +3

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness RLHF & Preference Learning

3w ago·also Beihang, JKU, Meituan, PKU

CDRRM: Contrast-Driven Rubric Generation for Reliable and Interpretable Reward Modeling

Forget noisy, biased LLM evaluators: CDRRM distills preference insights into compact rubrics, letting a frozen judge model leapfrog fully fine-tuned baselines with just 3k training samples.

Dengcan Liu, Fengkai Yang, Xiaohan Wang +6

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp RLHF & Preference Learning

3w ago·also Unitree

FAME: Force-Adaptive RL for Expanding the Manipulation Envelope of a Full-Scale Humanoid

Humanoid robots can now maintain balance under complex external forces without force/torque sensors, thanks to a force-adaptive RL policy that learns to anticipate and compensate for disturbances.

Niraj Pudasaini, Yutong Zhang, Jensen Lavering +2

RLHF & Preference Learning Robotics & Embodied AI World Models & Planning

Tsinghua AI3w ago

How Far Can Unsupervised RLVR Scale LLM Training?

Intrinsic reward signals in unsupervised RL for LLMs inevitably collapse due to sharpening of the model's prior, but external rewards grounded in computational asymmetries offer a path to sustained scaling.

Bingxiang He, Bingxiang He, Yuxin Zuo +35

RLHF & Preference Learning Scalable Oversight & Alignment Theory Training Efficiency & Optimization

3w ago

Agentic Critical Training

Instead of imitating reflections, LLM agents can be trained to reason about action quality by rewarding correct judgments between alternative actions, leading to improved performance and generalization.

Weize Liu, L, Minghui Liu +9

RLHF & Preference Learning Tool Use & Agents Training Efficiency & Optimization

Wenbin Wu3w ago

Aligning to Illusions: Choice Blindness in Human and AI Feedback

Human and AI feedback in RLHF are surprisingly susceptible to "choice blindness," where manipulated preferences often go unnoticed, undermining the reliability of alignment signals.

Wenbin Wu

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

Swetha Ganesh +13w ago

Breaking the Bias Barrier in Concave Multi-Objective Reinforcement Learning

Concave multi-objective RL suffers from a previously unaddressed gradient bias that doubles the sample complexity, but this can be fixed with multi-level Monte Carlo or, surprisingly, vanishes entirely with smooth scalarization functions.

Swetha Ganesh, Vaneet Aggarwal

Constitutional AI & AI Ethics RLHF & Preference Learning Robotics & Embodied AI

3w ago

CRED: Counterfactual Reasoning and Environment Design for Active Preference Learning

By "imagining" new scenarios and asking "What if this were the true preference?", CRED actively designs environments and trajectories to expose differences between competing reward functions, dramatically improving preference learning.

Yi-Shiuan Tung, Gyanig Kumar, Wei Jiang +2

RLHF & Preference Learning Robotics & Embodied AI World Models & Planning

JD Explore Academy3w ago·also JD.com

Fibration Policy Optimization

Achieve better token efficiency in LLM policy optimization by using a novel FiberPO objective whose Jacobian is block-diagonal over trajectories and reduces to identity on-policy.

Chang Li, Tshihao Tsu, Yaren Zhang +2

RLHF & Preference Learning Training Efficiency & Optimization

Hansi Zeng +93w ago

SynPlanResearch-R1: Encouraging Tool Exploration for Deep Research with Synthetic Plans

Jumpstart your research agent: synthetic tool-use plans overcome exploration bottlenecks and boost performance by up to 6% on multi-hop reasoning tasks.

Hansi Zeng, Zoey Li, Z. Li +7

Eval Frameworks & Benchmarks RLHF & Preference Learning Tool Use & Agents

Chi-Min Chan +73w ago

DC-W2S: Dual-Consensus Weak-to-Strong Training for Reliable Process Reward Modeling in Biological Reasoning

Strategic data curation using a dual-consensus approach beats brute-force training on large noisy datasets for process reward modeling in biological reasoning.

Chi-Min Chan, Ehsan Hajiramezanali, Xiner Li +5

Reasoning & Chain-of-Thought RLHF & Preference Learning Scientific Discovery & Drug Design

Mar 8, 2026

Microsoft Research3w ago·also Cambridge, Qinzheng Sun1

Scaling Data Difficulty: Improving Coding Models via Reinforcement Learning on Fresh and Challenging Problems

Forget massive datasets – targeted training on a smaller, carefully curated dataset of challenging competitive programming problems yields 3x faster gains in code generation performance.

Zongqian Li, Tengchao Lv, Shaohan Huang +7

Code Generation & Program Synthesis Data Curation & Synthetic Data RLHF & Preference Learning+1

Microsoft Research3w ago·also BIT, Cambridge

Breaking Training Bottlenecks: Effective and Stable Reinforcement Learning for Coding Models

By rethinking RLHF, MicroCoder-GRPO enables smaller code generation models to rival larger counterparts, achieving significant performance gains and revealing 34 training insights.

Zongqian Li, Shaohan Huang, Zewen Chi +5

Code Generation & Program Synthesis RLHF & Preference Learning Training Efficiency & Optimization

3w ago·also CUHK, Xiaohongshu

TDM-R1: Reinforcing Few-Step Diffusion Models with Non-Differentiable Reward

Finally, you can use human feedback and other real-world, non-differentiable rewards to fine-tune fast, few-step diffusion models.

Yihong Luo, Tianyang Hu, Weijian Luo +1

Computer Vision RLHF & Preference Learning Training Efficiency & Optimization

Mar 7, 2026

Huihan Tan +93w ago

Hindsight Credit Assignment for Long-Horizon LLM Agents

LLM agents can learn to solve complex, long-horizon tasks much more effectively by using themselves as post-hoc critics to refine Q-values through hindsight reasoning.

Huihan Tan, Xiao-Wen Yang, Hao Chen +7

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Mar 6, 2026

Juyong Jiang +53w ago

ReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning

Forget external debuggers: ReflexiCoder teaches LLMs to self-reflect and self-correct code, rivaling GPT-5.1 in performance while slashing inference costs by 40%.

Juyong Jiang, Jiasi Shen, Sunghun Kim +3

Code Generation & Program Synthesis Reasoning & Chain-of-Thought RLHF & Preference Learning

Independent3w ago·also Amazon Science, Meta AI, Stanford HAI, Northeastern

SAHOO: Safeguarded Alignment for High-Order Optimization Objectives in Recursive Self-Improvement

Recursive self-improvement can boost performance by 18% in code and 17% in reasoning, but only if you can keep it from going off the rails – SAHOO provides the guardrails.

Subramanyam Sahoo, Aman Chadha, Vinija Jain +1

Constitutional AI & AI Ethics RLHF & Preference Learning Scalable Oversight & Alignment Theory

3w ago·also CUHK, HKUST

Stabilizing Reinforcement Learning for Diffusion Language Models

StableDRL tames the wild instability of applying reinforcement learning to diffusion language models, enabling more reliable post-training optimization.

Jianyuan Zhong, Kaibo Wang, Ding Ding +3

Natural Language Processing RLHF & Preference Learning Training Efficiency & Optimization

Mar 5, 2026

BAIR3w ago

Reward-Conditioned Reinforcement Learning

Train one RL agent to handle a whole family of reward functions, unlocking robust and adaptable policies without the complexity of multi-task training.

Michal Nauman, Marek Cygan, Pieter Abbeel

RLHF & Preference Learning Robotics & Embodied AI

Amirabbas Afzali +43w ago

When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger

Weak LLMs, when strategically leveraged via confidence-based sample weighting, can not only drastically cut preference alignment costs but also surpass the performance of models trained on full human-labeled datasets.

Amirabbas Afzali, Myeongho Jeon, Myeong-Hwan Jeon +2

Constitutional AI & AI Ethics Data Curation & Synthetic Data RLHF & Preference Learning

Robin Young3w ago

Why Is RLHF Alignment Shallow? A Gradient Analysis

RLHF's reliance on gradient-based alignment inherently limits its depth, causing it to focus on early tokens and neglect later, potentially harmful, contextual dependencies.

Robin Young

Constitutional AI & AI Ethics RLHF & Preference Learning Scalable Oversight & Alignment Theory

Stanford HAI3w ago·also USC

Causally Robust Reward Learning from Reason-Augmented Preference Feedback

Grounding reward learning in natural language rationales makes policies 2x more robust to spurious correlations and distribution shifts.

Minjune Hwang, Yigit Korkmaz, Daniel Seita +2

Natural Language Processing Reasoning & Chain-of-Thought RLHF & Preference Learning

3w ago·also D visual features, The Luoteng Hangzhou Techonlogy Company

Learning Goal-Directed Rolling: Spherical Robot Point-to-Point Control Through Reinforcement Learning

Ditch the planner-tracker hierarchy: RL can directly control spherical robots for efficient point-to-point navigation, even transferring from sim-to-real with high stability.

Junjie An, Runhua Zhang, Yifan Liu +4

RLHF & Preference Learning Robotics & Embodied AI Training Efficiency & Optimization+1

3w ago

Reinforcement Learning for Robust Climbing Locomotion With Rope-Driven Legged Robot

A quadrupedal robot learns to climb steep slopes by "feeling" its own instability, using a learned Tumble Stability Margin to proactively avoid falls.

Jihong Kim, Joonhyuk Kwon, Jihaeng Lee +2

RLHF & Preference Learning Robotics & Embodied AI

Linghan Fang +23w ago

Boosting ASR Robustness via Test-Time Reinforcement Learning with Audio-Text Semantic Rewards

Reinforcement learning with audio-text semantic rewards can overcome confirmation bias in test-time adaptation, leading to more robust ASR in noisy and accented speech environments.

Linghan Fang, Tianxin Xie, Li liu

Natural Language Processing RLHF & Preference Learning Speech & Audio

Robin Young3w ago

Knowledge Divergence and the Value of Debate for Scalable Oversight

Debate between AI models hits a phase transition: it's useless when they know the same things, but becomes essential as their knowledge diverges.

Robin Young

RLHF & Preference Learning Scalable Oversight & Alignment Theory

Xiongkun Linghu +33w ago

3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding

Forget scaling laws: reinforcement fine-tuning with verifiable rewards lets a 4B parameter model beat an 8B parameter model on challenging 3D scene understanding tasks.

Xiongkun Linghu, Jiangyong Huang, Baoxiong Jia +1

Computer Vision Multimodal Models RLHF & Preference Learning

3w ago·also Fudan

BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning

PPO's fixed clipping hurts exploration by squashing high-reward, low-probability actions, but BandPO fixes this with probability-aware bounds that boost performance.

Yuan Li, Yuan Li, Bo Wang +9

RLHF & Preference Learning Training Efficiency & Optimization

Tsinghua AI3w ago·also Westlake, Zhipu

IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation

Current judge models for instruction-following are surprisingly unreliable, but a new benchmark exposes their flaws and offers a path to better alignment.

Bosi Wen, Bosi Wen, Yilin Niu +11

Eval Frameworks & Benchmarks RLHF & Preference Learning

Xingwu Chen +33w ago

Breaking Contextual Inertia: Reinforcement Learning with Single-Turn Anchors for Stable Multi-Turn Interaction

LLMs get stuck in their ways: even explicit corrections can't break their rigid adherence to initial (incorrect) reasoning paths in multi-turn interactions, but a novel RL approach can fix it.

Xingwu Chen, Zhanqiu Zhang, Yiwen Guo +1

Natural Language Processing Reasoning & Chain-of-Thought RLHF & Preference Learning

Biao Liu +43w ago

VRM: Teaching Reward Models to Understand Authentic Human Preferences

By explicitly modeling the latent human evaluation process, VRM offers a more robust reward model, sidestepping the pitfalls of spurious correlations that plague traditional methods.

Biao Liu, Ning Xu, Junming Yang +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

Microsoft Research3w ago

Scaling Agentic Capabilities, Not Context: Efficient Reinforcement Finetuning for Large Toolspaces

A 4B parameter SLM can now rival frontier agent performance in complex tool-use environments, thanks to a novel reinforcement finetuning framework that teaches it how to strategically acquire context and execute actions.

Karan Gupta, Pranav Vajreshwari, Yash Pandya +3

RLHF & Preference Learning Tool Use & Agents Training Efficiency & Optimization

Mar 4, 2026

Fairness Begins with State: Purifying Latent Preferences for Hierarchical Reinforcement Learning in Interactive Recommendation

By disentangling state representation from policy optimization, DSRM-HRL breaks the accuracy-fairness tradeoff in recommender systems, achieving state-of-the-art fairness without sacrificing utility.

yun lu

Constitutional AI & AI Ethics Recommendation & Information Retrieval RLHF & Preference Learning

Mar 4, 2026

Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization

Achieve over 2x training speedup for LLM reasoning without sacrificing accuracy by dynamically pruning Group Relative Policy Optimization (GRPO) with a novel importance sampling correction.

Haodong Zhu, Yangyang Ren, Yanjing Li +6

Inference & Quantization RLHF & Preference Learning Training Efficiency & Optimization

Mar 4, 2026

Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning

Unlock 2x faster reinforcement learning by distilling group feedback into actionable language refinements that guide exploration.

Lei Huang, Lei Huang, Xiang Cheng +15

Natural Language Processing RLHF & Preference Learning Tool Use & Agents

Maximilian von Klinski +1Mar 4, 2026

TaxonRL: Reinforcement Learning with Intermediate Rewards for Interpretable Fine-Grained Visual Reasoning

TaxonRL doesn't just beat humans at bird identification; it shows its work, revealing a transparent reasoning process that could revolutionize how we trust AI in complex visual tasks.

Maximilian von Klinski, Maximilian Schall

Computer Vision Interpretability & Mechanistic Interp RLHF & Preference Learning

OpenHandsMar 4, 2026·also CMU ML, UIUC

A Rubric-Supervised Critic from Sparse Real-World Outcomes

Learn a critic for coding agents from human-in-the-loop interaction traces alone, sidestepping the need for dense, verifiable rewards.

Xingyao Wang, Heng Ji

Code Generation & Program Synthesis RLHF & Preference Learning Tool Use & Agents

Geraldin Nanfack +1Mar 4, 2026

Efficient Refusal Ablation in LLM through Optimal Transport

LLM safety can be bypassed with optimal transport, revealing that refusal mechanisms are surprisingly localized within a few layers.

Geraldin Nanfack, Elvis Dohmatob

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness RLHF & Preference Learning

CMU MLMar 4, 2026

What Does Flow Matching Bring To TD Learning?

Flow matching's advantage in RL isn't distributional modeling, but rather its ability to correct value estimates iteratively and learn more adaptable features, leading to significant performance gains in challenging online settings.

Bhavya Agrawalla

RLHF & Preference Learning Training Efficiency & Optimization

Tsinghua AIMar 4, 2026

GIPO: Gaussian Importance Sampling Policy Optimization

Ditch hard clipping: GIPO's Gaussian-weighted importance sampling offers a smoother, more stable RL policy optimization, especially when dealing with stale or limited data.

Chengxuan Lu, Zhenquan Zhang, Shukuan Wang +2

Multimodal Models RLHF & Preference Learning Training Efficiency & Optimization

Mar 4, 2026

Optimizing Language Models for Crosslingual Knowledge Consistency

Multilingual LLMs can be made significantly more reliable by directly optimizing for crosslingual consistency using a DPO-inspired method that requires no explicit reward model.

Jirui Qi, Mrinmaya Sachan, Raquel Fernández +1

Eval Frameworks & Benchmarks Natural Language Processing RLHF & Preference Learning

CMU MLMar 4, 2026

HALyPO: Heterogeneous-Agent Lyapunov Policy Optimization for Human-Robot Collaboration

HALyPO stabilizes human-robot collaboration by directly certifying the convergence of decentralized policy learning in parameter space, sidestepping the oscillations that plague standard MARL approaches.

Yaru Niu, Yikai Wang, H. Eric Tseng

RLHF & Preference Learning Robotics & Embodied AI Tool Use & Agents

Qianyun Guo +2Mar 4, 2026·also SMU

Towards Realistic Personalization: Evaluating Long-Horizon Preference Following in Personalized User-LLM Interactions

LLMs struggle to maintain consistent personalization as conversations lengthen and preferences become less explicit, suggesting current models fall short of truly adaptive personal assistants.

Qianyun Guo, Yue Liu, Bryan Hooi

Eval Frameworks & Benchmarks Recommendation & Information Retrieval RLHF & Preference Learning

Mar 4, 2026·also Technical University of Berlin

Monitoring Emergent Reward Hacking During Generation via Internal Activations

Forget inspecting final outputs: LLMs telegraph their reward-hacking intentions internally, early in the generation process, via distinctive activation patterns.

Patrick Wilhelm, Thorsten Wittkopp, Odej Kao

Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness RLHF & Preference Learning