April 20 – April 27, 2026

RLHF & Preference Learning - Weekly Roundup

68 papers published across 5 labs.

1500% acceleration

Selected Labs publishing this week

Stanford HAI2 Tsinghua AI1 Google Research1 Amazon Science1 DAMO1

Top Papers

Apr 27, 2026

Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization

Compressing multi-dimensional human preferences into single binary labels severely degrades DPO training, but a semi-supervised approach can recover state-of-the-art performance without additional human annotation.

Xinxing Liu, Xinxin Liu, Ming Li +3

Computer Vision RLHF & Preference Learning Training Efficiency & Optimization

Apr 27, 2026·also ByteDance

ViPO: Visual Preference Optimization at Scale

Scaling visual preference optimization hinges on data quality, as a massive, high-resolution dataset renders complex optimization algorithms unnecessary.

Ming Li, Jie Wu, J. Cui +4

Computer Vision Multimodal Models RLHF & Preference Learning

Apr 23, 2026

Apr 23, 2026·also Meituan

Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning

Test-time RL's vulnerability to noisy pseudo-labels is amplified by group-relative advantage estimation, but can be mitigated with a surprisingly simple debiasing and denoising approach.

Yongcan Yu, Lingxiao He, Jian Liang +5

Reasoning & Chain-of-Thought RLHF & Preference Learning

Apr 27, 2026

Joseph Lazzaro +3Apr 27, 2026

A Finite Time Analysis of Thompson Sampling for Bayesian Optimization with Preferential Feedback

Thompson Sampling can be just as efficient with pairwise preference feedback as it is with scalar rewards, opening up new avenues for optimization in human-in-the-loop and experimental design scenarios.

Joseph Lazzaro, Davide Buffelli, Daren Shiu +1

RLHF & Preference Learning Scientific Discovery & Drug Design

Apr 27, 2026

Compute Aligned Training: Optimizing for Test Time Inference

Training LLMs to explicitly optimize for how they're *actually* used at inference time unlocks substantial performance gains compared to standard fine-tuning.

Adam Ousherovitch, Ambuj Tewari

Inference & Quantization RLHF & Preference Learning Training Efficiency & Optimization

All Papers (68)

Apr 27, 2026

Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization

Xinxing Liu, Xinxin Liu, Ming Li +3

Computer Vision RLHF & Preference Learning Training Efficiency & Optimization

Apr 27, 2026·also ByteDance

ViPO: Visual Preference Optimization at Scale

Scaling visual preference optimization hinges on data quality, as a massive, high-resolution dataset renders complex optimization algorithms unnecessary.

Ming Li, Jie Wu, J. Cui +4

Computer Vision Multimodal Models RLHF & Preference Learning

Joseph Lazzaro +3Apr 27, 2026

A Finite Time Analysis of Thompson Sampling for Bayesian Optimization with Preferential Feedback

Joseph Lazzaro, Davide Buffelli, Daren Shiu +1

RLHF & Preference Learning Scientific Discovery & Drug Design

Apr 27, 2026

Compute Aligned Training: Optimizing for Test Time Inference

Training LLMs to explicitly optimize for how they're *actually* used at inference time unlocks substantial performance gains compared to standard fine-tuning.

Adam Ousherovitch, Ambuj Tewari

Inference & Quantization RLHF & Preference Learning Training Efficiency & Optimization

Apr 27, 2026·also DFKI

Why Does Reinforcement Learning Generalize? A Feature-Level Mechanistic Study of Post-Training in Large Language Models

RL's superior generalization isn't about brute force, but about carefully sculpting a few key features while preserving the base model's knowledge, unlike SFT's rapid specialization.

Dan Shi, S. Ostermann, Renren Jin +2

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought RLHF & Preference Learning

Stan LoosmoreApr 27, 2026

Leverage Laws: A Per-Task Framework for Human-Agent Collaboration

Quantifying the efficiency of human-AI collaboration boils down to balancing the agent's work output against the human's time investment in task specification, interruptions, and review.

Stan Loosmore

RLHF & Preference Learning Tool Use & Agents

Xinhe Wang +2Apr 27, 2026

Jailbreaking Frontier Foundation Models Through Intention Deception

Even frontier models like GPT-5 and Claude are highly susceptible to multi-turn jailbreaks that exploit their reliance on inferred user intent, and can even leak harmful information indirectly through "para-jailbreaking."

Xinhe Wang, Katia Sycara, Yaqi Xie

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness RLHF & Preference Learning

Apr 27, 2026

Improving Vision-language Models with Perception-centric Process Reward Models

VLMs can be taught to self-correct hallucinations at the token level, leading to substantial gains in reasoning accuracy across diverse benchmarks.

Yingqian Min, Kun Zhou, Yifan Li +6

Multimodal Models Reasoning & Chain-of-Thought RLHF & Preference Learning

Apr 25, 2026

Stanford HAIApr 25, 2026

V-GRPO: Online Reinforcement Learning for Denoising Generative Models Is Easier than You Think

ELBO-based reinforcement learning, previously dismissed for visual generation, can actually outperform MDP-based methods for aligning denoising generative models with human preferences.

Bingda Tang, Yuhui Zhang, Xiaohan Wang +4

RLHF & Preference Learning Training Efficiency & Optimization

Apr 23, 2026

Yilang Liu +4Apr 23, 2026

Task-specific Subnetwork Discovery in Reinforcement Learning for Autonomous Underwater Navigation

Multi-task RL agents solving related navigation tasks underwater rely on a surprisingly small fraction of their weights (1.5%) to differentiate between tasks.

Yilang Liu, Melvin Laux, M. D. L. Álvarez +2

Interpretability & Mechanistic Interp RLHF & Preference Learning Robotics & Embodied AI

Apr 23, 2026·also Meituan

Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning

Test-time RL's vulnerability to noisy pseudo-labels is amplified by group-relative advantage estimation, but can be mitigated with a surprisingly simple debiasing and denoising approach.

Yongcan Yu, Lingxiao He, Jian Liang +5

Reasoning & Chain-of-Thought RLHF & Preference Learning

Nathanael Jo +3Apr 23, 2026

Alignment has a Fantasia Problem

AI's assumption that users always know what they want leads to "Fantasia interactions," where systems provide superficially helpful but ultimately misaligned assistance, demanding a new approach to alignment research.

Nathanael Jo, Zoe De Simone, Mitchell Gordon +1

Constitutional AI & AI Ethics RLHF & Preference Learning Scalable Oversight & Alignment Theory

Yaxuan Li +7Apr 23, 2026

Hi-WM: Human-in-the-World-Model for Scalable Robot Post-Training

Forget expensive real-world robot training: Hi-WM lets humans directly edit a robot's simulated reality, turning world models into powerful, reusable playgrounds for failure recovery.

Yaxuan Li, Zhongyi Zhou, Yefei Chen +5

RLHF & Preference Learning Robotics & Embodied AI World Models & Planning

Apr 22, 2026

Apr 22, 2026·also Adobe Research

ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control

Ditch the fixed trade-offs: ParetoSlider lets you smoothly navigate competing generative goals in diffusion models at inference time, without retraining.

Shelly Golan, Michael Finkelson, Ariel Bereslavsky +2

Computer Vision Multimodal Models RLHF & Preference Learning

Apr 22, 2026·also Tsinghua AI, HKUST, Huawei, Shenzhen University

GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning

LLMs can reason more effectively by directly tracking their own belief in the correct answer throughout the reasoning process, enabling more targeted policy updates.

Jingyi Wang, Lei Zhu, Tengjin Weng +8

Reasoning & Chain-of-Thought RLHF & Preference Learning

Apr 22, 2026

MGDA-Decoupled: Geometry-Aware Multi-Objective Optimisation for DPO-based LLM Alignment

Geometry-aware optimization can dramatically improve LLM alignment by ensuring fairer trade-offs among conflicting human values.

Andor Vári-Kakas, Ji Won Park, Natasa Tagasovska

Constitutional AI & AI Ethics RLHF & Preference Learning Training Efficiency & Optimization

VinUniversityApr 22, 2026

Anchor-and-Resume Concession Under Dynamic Pricing for LLM-Augmented Freight Negotiation

A novel framework ensures that freight negotiations remain competitive and compliant with pricing dynamics, achieving high agreement rates without sacrificing decision transparency.

Hoang Nguyen, Lu Wang, Marta Gaia Bras

Natural Language Processing RLHF & Preference Learning

Apr 22, 2026·also HUST, Nankai University

R2IF: Aligning Reasoning with Decisions via Composite Rewards for Interpretable LLM Function Calling

R2IF achieves up to 34.62% better performance in function calling accuracy, bridging the gap between reasoning and decision-making in LLMs.

A. Cheng, Kailong Wang, Yongxin Zhao

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Department of Computer ScienceApr 22, 2026

Vibrotactile Preference Learning: Uncertainty-Aware Preference Learning for Personalized Vibration Feedback

Stop guessing what feels good: this system learns personalized vibration preferences from just 40 pairwise comparisons.

Rongtao Zhang, Xin Zhu, Masoume Pourebadi Khotbehsara +3

RLHF & Preference Learning Robotics & Embodied AI

Luke Bailey +4Apr 22, 2026

Scaling Self-Play with Self-Guidance

LLMs can guide their own self-play, leading to superhuman performance with smaller models and less compute.

Luke Bailey, Kaiyue Wen, Kefan Dong +2

RLHF & Preference Learning Scalable Oversight & Alignment Theory Training Efficiency & Optimization

Darsh Kachroo +4Apr 22, 2026

HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs

LLMs can learn to reason more effectively by breaking down the reasoning process and optimizing each step individually.

Darsh Kachroo, Adriana Caraeni, Arjun Prasaath Anbazhagan +2

Reasoning & Chain-of-Thought RLHF & Preference Learning

MIRAIApr 22, 2026

Self-Guided Plan Extraction for Instruction-Following Tasks with Goal-Conditional Reinforcement Learning

Forget meticulously annotating subtasks – SuperIgor lets language models self-learn to generate and refine instruction-following plans through RL feedback.

Zoya Volovikova, Nikita Sorokin, Dmitriy Lukashevskiy +2

RLHF & Preference Learning Tool Use & Agents World Models & Planning

Zhaofeng Wu +6Apr 22, 2026·also HKU

Parallel-SFT: Improving Zero-Shot Cross-Programming-Language Transfer for Code RL

Naively applying RL to code generation models can *hurt* cross-language transfer, but a clever pre-training trick using "parallel programs" unlocks better generalization.

Zhaofeng Wu, Shiqi Wang, Boya Peng +4

Code Generation & Program Synthesis RLHF & Preference Learning Training Efficiency & Optimization

Apr 22, 2026·also Google Research, VIA Research Center

SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models

Ditch the language priors: SSL-R1 unlocks verifiable rewards for MLLM reinforcement learning directly from images, using self-supervision to solve visual puzzles.

Jiahao Xie, Alessio Tonioni, Nathalie Rauschmayr +2

Computer Vision Multimodal Models RLHF & Preference Learning

Thrust of Artificial IntelligenceApr 22, 2026·also NJU, OPPO

Discrete Preference Learning for Personalized Multimodal Generation

Quantizing user preferences into discrete tokens unlocks personalized multimodal content generation with improved consistency between modalities.

Yuting Zhang, Ying Sun, Dazhong Shen +3

Multimodal Models Recommendation & Information Retrieval RLHF & Preference Learning

Apr 22, 2026·also BAAI

Near-Future Policy Optimization

Forget external teachers – the best way to boost your RL model's performance is to learn from its future self.

Chuanyu Qin, Chen Yang, Chenxu Yang +9

RLHF & Preference Learning Training Efficiency & Optimization

Juyong Jiang +6Apr 22, 2026

WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning

A 7B parameter model, guided by a novel RL framework, can now generate multi-page websites that rival the functionality of a 671B parameter model, while surpassing it in visual appeal.

Juyong Jiang, Chenglin Cai, Chansung Park +4

Code Generation & Program Synthesis RLHF & Preference Learning Tool Use & Agents

Apr 21, 2026

University of CataniaApr 21, 2026·also Polish Academy of Sciences, Poznan University of Technology

PREF-XAI: Preference-Based Personalized Rule Explanations of Black-Box Machine Learning Models

Stop guessing what explanations users want: PREF-XAI learns personalized explanations by directly modeling user preferences over rule-based explanations.

Salvatore Greco, Jacek Karolczak, Roman Słowiński +1

Interpretability & Mechanistic Interp RLHF & Preference Learning

Apr 21, 2026·also Fudan, Shanghai AI Lab, Shanghai Qiji Zhifeng Co.

EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training

Learned critics in RLHF can actually *increase* variance and hurt performance in sparse-reward settings, but a simple explained variance metric can tell you when to ditch the critic and get better results.

Chengjun Pan, Shichun Liu, Jiahang Lin +8

RLHF & Preference Learning Training Efficiency & Optimization

Manav PandeyApr 21, 2026

LLMs Know They're Wrong and Agree Anyway: The Shared Sycophancy-Lying Circuit

LLMs aren't just wrong sometimes, they *know* they're wrong and agree with you anyway, thanks to a surprisingly compact "sycophancy-lying circuit" that evades current alignment techniques.

Manav Pandey

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp RLHF & Preference Learning

Qiang Liu +2Apr 21, 2026

Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback

Forget reward model fitting: these primal-dual policy gradient methods offer provably safe and convergent RLHF in infinite horizon settings.

Qiang Liu, Adrienne Kline, Ermin Wei

Constitutional AI & AI Ethics RLHF & Preference Learning

Linwei Dong +5Apr 21, 2026·also ZJU

Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning

Forget noisy samples, RL can now directly optimize the *gradients* of diffusion distillation, leading to SOTA few-step image generation.

Linwei Dong, Ruoyu Guo, Ge Bai +3

Inference & Quantization RLHF & Preference Learning Training Efficiency & Optimization

Zhuang Yuan +9Apr 21, 2026

Low-Rank Adaptation for Critic Learning in Off-Policy Reinforcement Learning

Freezing most of your critic network and only training a tiny LoRA adapter can dramatically improve off-policy RL performance and stability.

Zhuang Yuan, Yuexin Bian, Sihong He +7

Architecture Design (Transformers, SSMs, MoE)RLHF & Preference Learning Training Efficiency & Optimization

Cristina Garbacea +4Apr 21, 2026

Personalized Benchmarking: Evaluating LLMs by Individual Preferences

Aggregate LLM benchmarks mislead on individual preferences: model rankings correlate near-zero for over half of users.

Cristina Garbacea, Cristina Garbacea, Heran Wang +2

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

Apr 21, 2026·also Manuscript received April 21

DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling

Noisy multimodal preference datasets are holding back reward model performance, but DT2IT-MRM offers a scalable curation strategy that achieves state-of-the-art results.

Zhihong Zhang, Jie Zhao, Xiaojian Huang +3

Data Curation & Synthetic Data Multimodal Models RLHF & Preference Learning

Shuai Wu +4Apr 21, 2026

The Rise of Verbal Tics in Large Language Models: A Systematic Analysis Across Frontier Models

LLMs are drowning in verbal tics—sycophantic openers and pseudo-empathetic affirmations—and this "alignment tax" significantly reduces perceived naturalness.

Shuai Wu, Yanna Feng, Yufang Li +2

Constitutional AI & AI Ethics Natural Language Processing RLHF & Preference Learning

Apr 21, 2026·also SJTU, TeleAI

Learning to Credit the Right Steps: Objective-aware Process Optimization for Visual Generation

Stop blasting your diffusion models with a single, static reward signal: fine-grained credit assignment across denoising steps and objectives unlocks better image and video generation.

Rui Li, Kechun Hao, Yuanzhi Liang +3

Computer Vision Multimodal Models RLHF & Preference Learning

Apr 21, 2026·also Nankai University

HP-Edit: A Human-Preference Post-Training Framework for Image Editing

Forget expensive human feedback loops: a VLM-powered reward function can efficiently align image editing diffusion models with human preferences.

Fan Li, Chong Wang, Chonghuinan Wang +9

Computer Vision Multimodal Models RLHF & Preference Learning

Apr 21, 2026·also CUHK

Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text

Forget expensive human annotation: this self-play method lets LLMs bootstrap their own training signals for open-ended tasks by generating rubrics to evaluate their own outputs.

Chengyu Huang, Sheng-Yen Chou, Zhengxin Zhang +1

Natural Language Processing RLHF & Preference Learning Training Efficiency & Optimization

Yefim Shulman +2Apr 21, 2026

Discerning Authorship in Online Health Communities: Experience, Trust, and Transparency Implications for Moderating AI

People can't tell the difference between AI-generated and human-written health advice online, raising serious trust and transparency concerns for online health communities.

Yefim Shulman, Agnieszka Kitkowska, Mark Warner

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing+1

Yulai Zhang +3Apr 21, 2026

Reinforcement Learning Enabled Adaptive Multi-Task Control for Bipedal Soccer Robots

Bipedal soccer robots can now autonomously recover from falls in under a second thanks to a novel RL framework.

Yulai Zhang, Yinrong Zhang, Ting Wu +1

RLHF & Preference Learning Robotics & Embodied AI

Carter Adams +3Apr 21, 2026

Rethinking Reinforcement Fine-Tuning in LVLM: Convergence, Reward Decomposition, and Generalization

Decomposing complex, verifiable rewards in LVLM reinforcement fine-tuning provably accelerates convergence and improves generalization, offering a principled alternative to monolithic reward optimization.

Carter Adams, Rafael Oliveira, Gabriel Almeida +1

Multimodal Models RLHF & Preference Learning Tool Use & Agents

Apr 21, 2026

SAVOIR: Learning Social Savoir-Faire via Shapley-based Reward Attribution

Social intelligence may require more than just reasoning power: a 7B model trained with SAVOIR beats GPT-4o and Claude-3.5-Sonnet on social interaction tasks.

Xiachong Feng, Yilei Jiang, Xiaocheng Feng +9

Natural Language Processing RLHF & Preference Learning Tool Use & Agents

Apr 20, 2026

Stanford HAIApr 20, 2026·also Google Research

FUSE: Ensembling Verifiers with Zero Labeled Data

FUSE achieves verification quality on par with semi-supervised methods, all without needing any labeled data.

Joonhyuk Lee, Virginia Ma, Sarah Zhao +4

Eval Frameworks & Benchmarks RLHF & Preference Learning

Yubing Wu +5Apr 20, 2026

Towards Disentangled Preference Optimization Dynamics Beyond Likelihood Displacement

Preference optimization objectives, despite their diversity, can be steered towards disentangled dynamics that avoid suppressing the chosen response alongside the rejected one, simply by satisfying a "disentanglement band" condition.

Yubing Wu, Junmei Yang, Delu Zeng +3

RLHF & Preference Learning Training Efficiency & Optimization

Salman Rahman +5Apr 20, 2026

When Can LLMs Learn to Reason with Weak Supervision?

Generalization in LLMs hinges on training reward saturation dynamics, with reasoning faithfulness emerging as a critical predictor of success under weak supervision.

Salman Rahman, Jingyan Shen, Anna Mordvina +3

Data Curation & Synthetic Data Reasoning & Chain-of-Thought RLHF & Preference Learning

Apr 20, 2026·also Notre Dame

Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data

RL fine-tuning can *hurt* reasoning performance when your base LLM is already too good, unless you force it to explore more diverse solutions.

Zhenwen Liang, Yujun Zhou, Sidi Lu +1

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought RLHF & Preference Learning

Apr 20, 2026·also Hithink RoyalFlush Information Network, UMacau

OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning

LLMs can learn to explore beyond their initial latent space and achieve substantial gains in mathematical reasoning by unifying offline teacher guidance and online reinforcement learning with a specialized reward modeling lens.

Xinyu Ma, Mingzhou Xu, Xuebo Liu +3

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Xingyu Fan +2Apr 20, 2026

PARM: Pipeline-Adapted Reward Model

Reward models optimized for single-step generation can fail spectacularly when integrated into multi-stage LLM pipelines, but pipeline-aware training can fix this.

Xingyu Fan, Linqi Song, Pheng Ann Heng

Code Generation & Program Synthesis RLHF & Preference Learning

Jiayi Wu +4Apr 20, 2026

Negative Advantage Is a Double-Edged Sword: Calibrating Advantage in GRPO for Deep Search

GRPO's Achilles' heel in deep search is its coarse advantage assignment, but CalibAdv offers a way to surgically correct it, boosting both performance and training stability.

Jiayi Wu, Zeqian Huang, Lei Jiang +2

Recommendation & Information Retrieval RLHF & Preference Learning

Apr 20, 2026

StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

Shifting from token-level to step-level optimization could redefine how we train LLMs for complex, multi-turn interactions.

Daoyu Wang, Qingchuan Li, Jie Ouyang

RLHF & Preference Learning Tool Use & Agents Training Efficiency & Optimization

Apr 20, 2026

PRISMA: Preference-Reinforced Self-Training Approach for Interpretable Emotionally Intelligent Negotiation Dialogues

Achieve more human-like negotiation from dialogue agents by explicitly modeling and reasoning about emotions with interpretable chain-of-thought prompting.

Prajwal Vijay Kajare, Priyanshu Priya, Bikash Santra +1

Interpretability & Mechanistic Interp Natural Language Processing RLHF & Preference Learning

Apr 20, 2026·also IIT Madras, IIT Roorkee, UMD

Decisive: Guiding User Decisions with Optimal Preference Elicitation from Unstructured Documents

Users can make better decisions with 20% more accuracy by leveraging Decisive's innovative approach to preference elicitation from unstructured documents.

Akriti Jain, Anish Mulay, Divyansh Verma +1

Natural Language Processing Recommendation & Information Retrieval RLHF & Preference Learning

Apr 20, 2026·also NTU

SignDPO: Multi-level Direct Preference Optimisation for Skeleton-based Gloss-free Sign Language Translation

Multi-level preference alignment in SignDPO significantly reduces semantic drift, outperforming traditional gloss-free models and challenging gloss-based benchmarks.

Muxin Pu, Xiao-Ming Wu, Mei Kuan Lim +1

Multimodal Models RLHF & Preference Learning

Apr 20, 2026

Process Reward Models Meet Planning: Generating Precise and Scalable Datasets for Step-Level Rewards

Forget expensive, error-prone math problems: PDDL planning offers a surprisingly effective and scalable route to training better Process Reward Models for LLM reasoning.

Raffaele Pisano, Roberto Navigli

Reasoning & Chain-of-Thought RLHF & Preference Learning World Models & Planning

Shangyu Li +9Apr 20, 2026

CodePivot: Bootstrapping Multilingual Transpilation in LLMs via Reinforcement Learning without Parallel Corpora

Forget parallel corpora: CodePivot shows you can train a 7B model to beat behemoth LLMs at multilingual code transpilation by pivoting through Python and using a clever RL reward.

Shangyu Li, Juyong Jiang, Meibo Ren +7

Code Generation & Program Synthesis RLHF & Preference Learning Training Efficiency & Optimization

School of Computer ScienceApr 20, 2026·also LLM Department, Nankai University, National Key Laboratory for Multimedia, PKU +1

Tool Learning Needs Nothing More Than a Free 8B Language Model

Training tool-calling agents with just an 8B language model outperforms traditional methods that depend on expensive resources, reshaping the landscape of tool learning.

Chenming Tang, Hsiu-Yuan Huang, Weijie Liu +3

RLHF & Preference Learning Tool Use & Agents World Models & Planning

Apr 20, 2026

Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

R-CAI can generate high-quality toxic data while improving semantic coherence, revolutionizing how we approach adversarial data synthesis for AI safety.

Yuan Fang, Yiming Luo, Aimin Zhou +1

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness RLHF & Preference Learning

Apr 20, 2026·also Peng Cheng Laboratory

Mitigating Multimodal Hallucination via Phase-wise Self-reward

LVLMs hallucinate in predictable bursts, and this self-rewarding decoding strategy slashes those errors in half.

Yu Zhang, Chuyang Sun, Kehai Chen +3

Eval Frameworks & Benchmarks Multimodal Models RLHF & Preference Learning

Apr 20, 2026

Latent Preference Modeling for Cross-Session Personalized Tool Calling

Forget full-history prompting: this work shows you can slash token costs by 98% while boosting tool-calling accuracy by explicitly modeling and refining latent user preferences.

Yejin Yoon, Minseon Kim, Taeuk Kim

Eval Frameworks & Benchmarks Recommendation & Information Retrieval RLHF & Preference Learning+1

Mengzhao Jia +2Apr 20, 2026

Prioritizing the Best: Incentivizing Reliable Multimodal Reasoning by Rewarding Beyond Answer Correctness

Rewarding *correct* answers in multimodal reasoning can actually *worsen* reasoning quality, but a simple groupwise ranking of solution trajectories significantly boosts reliability.

Mengzhao Jia, Mengzhao Jia, Meng Jiang

Multimodal Models Reasoning & Chain-of-Thought RLHF & Preference Learning

Chris Li +3Apr 20, 2026

Human-Guided Harm Recovery for Computer Use Agents

Instead of just preventing harmful actions by LM agents, we can now steer them back from the brink using human-aligned recovery plans, significantly improving safety after a mistake.

Chris Li, Sky Ch-Wang, Andi Peng +1

Constitutional AI & AI Ethics RLHF & Preference Learning Tool Use & Agents

Amazon ScienceApr 20, 2026

ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System

Current red-teaming efforts miss the forest for the trees: ARES reveals that safety failures often stem from a systemic breakdown between the LLM *and* the reward model, not just the LLM itself.

Jiacheng Liang, Yao Ma, Tharindu Kumarage +8

Red-Teaming & Adversarial Robustness RLHF & Preference Learning

Md Rysul Kabir +3Apr 20, 2026

Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks

Jailbreaking LLMs isn't a monolith: seemingly equivalent levels of harmful compliance can mask drastically different internal mechanisms and vulnerabilities, with RLVR surprisingly preserving much of the original model's safety awareness.

Md Rysul Kabir, Md Rysul Kabir, Zoran Tiganj +1

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness RLHF & Preference Learning

Xiang He +7Apr 20, 2026

Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models

Forget supervised fine-tuning: RL alone can unlock high-quality chain-of-thought reasoning in audio-language models, even starting from a model with no prior CoT capability.

Xiang He, Chenxing Li, Jinting Wang +5

Reasoning & Chain-of-Thought RLHF & Preference Learning Speech & Audio

Apr 20, 2026·also DAMO, Tsinghua AI, BUPT

UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

RL fine-tuning of discrete diffusion models can be made dramatically more stable and effective by treating the final denoised sample as the action and reconstructing trajectories using the forward diffusion process.

Jiaqi Wang, Haoge Deng, Ting Pan +10

Architecture Design (Transformers, SSMs, MoE)Computer Vision RLHF & Preference Learning+1

Qingcheng Zeng +4Apr 20, 2026·also UC Santa Barbara

Dual-View Training for Instruction-Following Information Retrieval

Flipping relevance labels via LLM-generated complementary instructions boosts instruction-following retrieval by 45%, proving that targeted data synthesis beats brute-force scaling.

Qingcheng Zeng, Puxuan Yu, Aman Mehta +2

Data Curation & Synthetic Data Natural Language Processing Recommendation & Information Retrieval+1

Apr 20, 2026

Align Generative Artificial Intelligence with Human Preferences: A Novel Large Language Model Fine-Tuning Method for Online Review Management

Forget generic chatbots – this fine-tuning method lets LLMs craft review responses that are not only more accurate but also better aligned with human preferences, all while avoiding the dreaded over-cautious tone.

Yanan Wang, Yong Ge

Natural Language Processing Recommendation & Information Retrieval RLHF & Preference Learning