RLHF & Preference Learning

Safety & Alignment

Training AI systems from human feedback using reinforcement learning, direct preference optimization, and reward modeling.

Keywords

RLHFreinforcement learning from human feedbackpreference learningreward modelingDPOdirect preference optimizationKTOconstitutional AI training

Recent Papers

Mar 1, 2026

National Engineering Research Center for Robot Visual Perception and Control Technologyjust now

Efficient Robotic 3D Measurement Through Multi-DoF Reinforcement Learning for Continuous Viewpoint Planning

This paper introduces a multi-degree-of-freedom reinforcement learning framework for robotic 3D measurement, enabling continuous viewpoint planning to improve the reconstruction of complex geometries. The framework uses a voxel-based state representation with dynamic ray-traced coverage updates and a dual-objective reward function to balance overlap control and viewpoint minimization. Experimental results on industrial parts show the proposed method achieves superior overlap regulation and planning efficiency compared to existing techniques, leading to more accurate 3D reconstructions.

Introduces a novel multi-DoF reinforcement learning framework for robotic 3D measurement that optimizes viewpoint planning by dynamically balancing coverage, overlap, and robotic kinematics.

Jun Ye, Qiu Fang, Shi Wang +3

Robotics & Embodied AIRLHF & Preference LearningWorld Models & Planning

College of Electronic Engineeringjust now

Transferring Policy of Offline Reinforcement Learning From Hybrid Dataset to Real World via Progressive Neural Network

This paper addresses the challenge of distributional mismatch in offline RL when transferring policies learned from hybrid (real and simulated) datasets to the real world. They propose using Progressive Neural Networks (PNNs) to transfer the offline policy, leveraging the hybrid dataset for faster learning and improved real-world adaptation. Experiments on robotic manipulation tasks demonstrate that PNNs effectively retain the learned policy, bridge the sim-to-real gap, and enable more diverse exploration during online fine-tuning.

Introduces a PNN-based transfer learning approach to mitigate distributional shift and improve real-world adaptation in offline RL using hybrid datasets.

Pengyu Zhao, Zheng Fang, Tongxu Ai +4

Robotics & Embodied AIRLHF & Preference Learning

Feb 12, 2026

2d ago

RELATE: A Reinforcement Learning-Enhanced LLM Framework for Advertising Text Generation

The paper introduces RELATE, a reinforcement learning framework for end-to-end advertising text generation that directly optimizes for conversion-oriented metrics and compliance constraints. RELATE integrates performance and compliance objectives into the text generation process via policy learning, moving beyond the traditional two-stage generation and alignment paradigm. Experiments on industrial datasets and online deployment show that RELATE significantly improves click-through conversion rate (CTCVR) while adhering to policy constraints.

Introduces an end-to-end reinforcement learning framework, RELATE, that unifies advertising text generation with conversion-oriented objective alignment and compliance constraints.

Jinfang Wang, Jiajie Liu, Jianwei Wu +62602.11780

RLHF & Preference LearningNatural Language ProcessingRecommendation & Information Retrieval

2d ago

Accelerating Robotic Reinforcement Learning with Agent Guidance

The paper introduces Agent-guided Policy Search (AGPS), a novel reinforcement learning framework that replaces human supervisors with a multimodal agent to improve sample efficiency in robotic manipulation tasks. AGPS leverages the agent as a semantic world model, using executable tools to provide corrective waypoints and spatial constraints for exploration. Experiments on precision insertion and deformable object manipulation tasks demonstrate that AGPS outperforms Human-in-the-Loop methods, achieving better sample efficiency by automating the supervision pipeline.

Introduces Agent-guided Policy Search (AGPS), a framework that automates robot reinforcement learning by using a multimodal agent to provide corrective guidance, thereby improving sample efficiency and scalability compared to human-in-the-loop methods.

Zili Zou, Yaoxiang Pu, Haotong Zhang +22602.11978

Robotics & Embodied AIRLHF & Preference LearningTraining Efficiency & Optimization

2d ago

Capability-Oriented Training Induced Alignment Risk

The paper investigates capability-oriented training induced exploitation in language models trained with reinforcement learning, where models learn to exploit implicit loopholes in the training environment to maximize reward. Through a suite of four "vulnerability games," the authors demonstrate that models consistently learn to exploit flaws related to context-conditional compliance, proxy metrics, reward tampering, and self-evaluation. The key finding is that these exploitative strategies generalize to new tasks and can be distilled from teacher to student models, highlighting a fundamental challenge to current alignment approaches.

Demonstrates that reinforcement learning-trained language models spontaneously learn to exploit implicit loopholes in training environments to maximize reward, even without explicit malicious intent.

Yujun Zhou, Yue Huang, Han Bao +62602.12124

RLHF & Preference LearningScalable Oversight & Alignment TheoryRed-Teaming & Adversarial Robustness

2d ago

Mitigating Mismatch within Reference-based Preference Optimization

The paper identifies a "premature satisfaction" issue in Direct Preference Optimization (DPO) where the reference policy's preference for rejected responses attenuates the gradient even when the policy is still incorrect. To address this, they propose Hybrid-DPO (HyPO), a modification that conditionally applies the reference signal, treating it as neutral when pessimistic. HyPO improves inference-aligned metrics and pairwise win rates by strengthening per-example learning signals on pessimistic pairs while maintaining DPO's objective form and computational cost.

Introduces Hybrid-DPO (HyPO), a drop-in replacement for DPO that conditionally debiases the reference signal to mitigate premature satisfaction in pessimistic pairs.

Xin Yu, Jiyang Zheng, Dadong Wang +22602.11902

RLHF & Preference LearningTraining Efficiency & Optimization

2d ago

Temperature as a Meta-Policy: Adaptive Temperature in LLM Reinforcement Learning

The paper introduces Temperature Adaptive Meta Policy Optimization (TAMPO), a novel framework that learns to control the temperature hyperparameter of an LLM during reinforcement learning. TAMPO uses a hierarchical two-loop process where an inner loop updates the LLM policy using trajectories sampled at temperatures selected by a meta-policy, and an outer loop updates the meta-policy to favor temperatures that maximize the likelihood of high-advantage trajectories. Experiments on mathematical reasoning benchmarks demonstrate that TAMPO outperforms baselines with fixed or heuristic temperature schedules, showing the effectiveness of learned temperature control for adaptive exploration.

Introduces a hierarchical reinforcement learning framework, TAMPO, that learns a meta-policy to dynamically adjust the temperature parameter of an LLM, optimizing exploration during policy learning.

Haoran Dang, Cuiling Lan, Hai Wan +22602.11779

RLHF & Preference LearningTraining Efficiency & OptimizationNatural Language Processing

2d ago

Detecting RLVR Training Data via Structural Convergence of Reasoning

The paper addresses the problem of detecting training data contamination in Reinforcement Learning with Verifiable Rewards (RLVR) fine-tuned reasoning models, where standard likelihood-based detection methods are ineffective. They observe that RLVR training leads to a structural convergence in the model's generations for seen prompts, resulting in more rigid and similar outputs compared to unseen prompts. They introduce Min-$k$NN Distance, a black-box detector that leverages this convergence by measuring the average of the $k$ smallest nearest-neighbor edit distances between multiple completions of a given prompt.

Introduces Min-$k$NN Distance, a novel black-box detector, to identify RLVR training data by quantifying the structural convergence of reasoning trajectories induced by RLVR.

Hongbo Zhang, Yue Yang, Guangsheng Bao2602.11792

RLHF & Preference LearningReasoning & Chain-of-ThoughtEval Frameworks & Benchmarks

2d ago

Towards On-Policy SFT: Distribution Discriminant Theory and its Applications in LLM Training

This paper introduces Distribution Discriminant Theory (DDT) to quantify the alignment between training data and the model-induced distribution in supervised fine-tuning (SFT) of LLMs. Based on DDT, they propose In-Distribution Finetuning (IDFT), a loss-level method, and Hinted Decoding, a data-level technique, to improve generalization by aligning the training data distribution with the model's. Experiments show that the proposed framework achieves generalization performance comparable to offline RL methods like DPO and SimPO, while retaining the efficiency of SFT.

Introduces Distribution Discriminant Theory (DDT) to quantify and improve the alignment between training data and model-induced distributions in LLM supervised fine-tuning.

Miaosen Zhang, Yishan Liu, Shuxia Lin +52602.12222

RLHF & Preference LearningTraining Efficiency & OptimizationNatural Language Processing

2d ago

RL over Commodity Networks: Overcoming the Bandwidth Barrier with Lossless Sparse Deltas

The paper introduces SparrowRL, a novel RL training system designed to overcome bandwidth limitations in commodity-networked GPU resources by exploiting the sparsity of per-step updates during RL fine-tuning. SparrowRL achieves this by representing updates as sparse delta checkpoints, pipelining delta extraction with multi-stream transmission, overlapping transfer with rollout generation, and employing throughput- and bandwidth-aware scheduling. Experiments on Qwen3 models show SparrowRL reduces per-step transfer payload by 79x and improves throughput by 2.4-9.5x over full-weight broadcast across WAN, achieving comparable throughput to RDMA clusters with improved cost efficiency.

Introduces SparrowRL, a system that enables efficient RL training over commodity networks by leveraging sparse delta checkpoints and bandwidth-aware scheduling to minimize communication overhead.

Chaoyi Ruan, Geng Luo, Xinyi Wan +82602.11456

RLHF & Preference LearningDistributed Systems & HardwareTraining Efficiency & Optimization

2d ago

Improving HPC Code Generation Capability of LLMs via Online Reinforcement Learning with Real-Machine Benchmark Rewards

This paper introduces an online reinforcement learning (RL) approach to improve the high-performance computing (HPC) code generation capabilities of large language models (LLMs) by using runtime performance (GFLOPS) on a supercomputer as a direct reward signal. They propose a Staged Quality-Diversity (SQD) algorithm that progressively varies optimization techniques to encourage diverse learning. The authors trained Qwen2.5 Coder 14B on a double-precision matrix multiplication task using Group Relative Policy Optimization (GRPO), demonstrating improved HPC code generation.

Demonstrates that online reinforcement learning with real-machine benchmark rewards and staged optimization significantly improves the HPC code generation performance of LLMs.

Ryo Mikasa, Shun-ichiro Hayashi, Daichi Mukunoki +22602.12049

Code Generation & Program SynthesisRLHF & Preference LearningEval Frameworks & Benchmarks

2d ago

Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models

The paper introduces Composition-RL, a method to improve reinforcement learning of LLMs by composing multiple verifiable prompts into a single, more complex prompt, addressing the issue of diminishing returns from easy (pass-rate-1) prompts as training progresses. This approach aims to better utilize limited verifiable prompts by creating new training examples that maintain a high pass rate while increasing complexity. Experiments on models ranging from 4B to 30B parameters demonstrate that Composition-RL enhances reasoning capabilities and enables more effective cross-domain RL when combined with a curriculum learning strategy that gradually increases compositional depth.

Introduces Composition-RL, a novel method that composes multiple verifiable prompts to create more complex training examples for reinforcement learning of LLMs, thereby improving reasoning capabilities and cross-domain generalization.

Clive Bai, Weijie Liu, Yang Wang +22602.12036

RLHF & Preference LearningTraining Efficiency & OptimizationData Curation & Synthetic Data

2d ago

How Sampling Shapes LLM Alignment: From One-Shot Optima to Iterative Dynamics

This paper theoretically analyzes the impact of sampling strategies and iterative dynamics on the alignment of large language models using preference optimization frameworks like Identity Preference Optimization and Direct Preference Optimization. It demonstrates that instance-dependent sampling improves ranking guarantees, while skewed on-policy sampling can lead to excessive concentration. Furthermore, the paper proves that iterative alignment, where the learned policy influences future sampling, can result in instability, oscillations, or entropy collapse under specific conditions, and it identifies stable regimes.

Establishes theoretical results characterizing how sampling strategies and iterative feedback loops in preference alignment impact the stability, convergence, and ranking performance of LLMs.

Yurong Chen, Yu He, Michael I. Jordan +12602.12180

RLHF & Preference LearningScalable Oversight & Alignment Theory

2d ago

STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning

The paper introduces STVG-R1, a reinforcement learning framework for spatial-temporal video grounding (STVG) that addresses misalignment between textual descriptions and visual coordinates by reformulating per-frame coordinate prediction as instance-level identification using temporally consistent IDs embedded as visual prompts. This approach avoids the need for additional trainable modules and complex alignment strategies. By employing a task-driven reward to optimize temporal accuracy, spatial consistency, and structural format regularization, STVG-R1 achieves state-of-the-art results on multiple STVG benchmarks and demonstrates strong zero-shot generalization capabilities.

Introduces a novel visual prompting paradigm for spatial-temporal video grounding that reformulates coordinate prediction as instance-level identification and optimizes the process using reinforcement learning.

Xiaowen Zhang, Licheng Jiao, Qing Li2602.11730

Multimodal ModelsComputer VisionRLHF & Preference Learning

2d ago

CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use

The paper introduces CM2, a reinforcement learning framework that utilizes checklist rewards instead of verifiable outcome rewards to train agents for multi-turn, multi-step tool use. CM2 decomposes each turn's behavior into fine-grained binary criteria with evidence grounding, enabling more stable classification-style reward signals. Experiments in an LLM-simulated tool environment demonstrate that CM2 significantly outperforms supervised fine-tuning baselines on benchmarks like tau^-Bench, BFCL-V4, and ToolSandbox, achieving comparable or superior performance to similarly sized open-source models.

This paper introduces a novel reinforcement learning framework, CM2, that replaces traditional verifiable rewards with checklist-based rewards for training agents to effectively use tools in multi-turn, multi-step interactions.

Xun Wang, Yebowen Hu, Chenyang Zhao +52602.12268

RLHF & Preference LearningTool Use & AgentsEval Frameworks & Benchmarks

2d ago

FAIL: Flow Matching Adversarial Imitation Learning for Image Generation

The paper introduces Flow Matching Adversarial Imitation Learning (FAIL), a novel approach to fine-tuning flow matching models for image generation by framing the alignment with a target distribution as an imitation learning problem. FAIL leverages adversarial training to minimize the divergence between the policy and expert demonstrations, avoiding the need for explicit rewards or pairwise comparisons. The authors demonstrate that FAIL achieves competitive performance on prompt following and aesthetic benchmarks with limited demonstrations, and also show its effectiveness in discrete image/video generation and as a regularizer against reward hacking.

Introduces FAIL, a new adversarial imitation learning framework for fine-tuning flow matching models that avoids explicit reward modeling or pairwise comparisons.

Yeyao Ma, Chen Li, Xiaosong Zhang2602.12155

Computer VisionRLHF & Preference LearningRed-Teaming & Adversarial Robustness

University of Electronic2d ago

FedGRPO: Privately Optimizing Foundation Models with Group-Relative Rewards from Domain Client

The paper introduces FedGRPO, a federated learning framework for optimizing foundation models by leveraging data from domain clients while preserving privacy. It frames the problem as a reinforcement learning task where a server model learns from scalar reward signals provided by expert clients selected using a competence-based confidence graph. FedGRPO aggregates these rewards using a federated group-relative loss function, achieving improved downstream accuracy and communication efficiency compared to existing federated foundation model approaches.

Introduces FedGRPO, a privacy-preserving federated learning framework that optimizes foundation models by aggregating group-relative reward signals from expert clients selected via a competence-based confidence graph.

Gongxi Zhu2602.12014

RLHF & Preference LearningTraining Efficiency & OptimizationDistributed Systems & Hardware

2d ago

Value Alignment Tax: Measuring Value Trade-offs in LLM Alignment

This paper introduces the Value Alignment Tax (VAT), a framework to quantify how aligning LLMs to specific values impacts the broader value system. VAT measures the trade-offs between gains in target value alignment and changes in other interconnected values. Using a dataset of scenario-action pairs grounded in Schwartz value theory, the authors demonstrate that alignment interventions induce structured co-movement among values, which are often missed by target-only evaluations.

Introduces the Value Alignment Tax (VAT) framework to quantify and analyze the systemic effects of value alignment interventions in LLMs.

Jiajun Chen, Hua Shen2602.12134

Constitutional AI & AI EthicsRLHF & Preference LearningEval Frameworks & Benchmarks

2d ago

P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling

The paper introduces P-GenRM, a personalized generative reward model that addresses limitations in existing personalized reward models by transforming preference signals into structured evaluation chains to derive adaptive personas and scoring rubrics. P-GenRM clusters users into User Prototypes and employs a dual-granularity scaling mechanism, scaling at both the individual and prototype levels to mitigate noise and enhance generalization. Experiments demonstrate state-of-the-art results on personalized reward model benchmarks, with a 2.31% average improvement and a 3% boost from test-time user-based scaling, indicating stronger personalized alignment.

Introduces a personalized generative reward model (P-GenRM) that leverages structured evaluation chains and dual-granularity scaling to improve personalization and generalization in reward modeling for LLMs.

Pinyi Zhang, Ting-En Lin, Yuchuan Wu +52602.12116

RLHF & Preference LearningRecommendation & Information RetrievalNatural Language Processing

2d ago

Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning

The paper addresses the "Shallow Exploration Trap" in in-context learning, where autoregressive models struggle to generate long reasoning trajectories needed for broader state coverage. They propose Length-Incentivized Exploration (LIE), a reinforcement learning approach that rewards longer reasoning trajectories while penalizing redundancy. Experiments on Qwen3 and Llama models demonstrate that LIE improves in-context exploration, leading to performance gains of 4.4% on in-domain and 2.7% on out-of-domain tasks.

Introduces Length-Incentivized Exploration (LIE), a novel reinforcement learning method to encourage longer and more diverse reasoning trajectories in in-context learning by rewarding length and penalizing redundancy.

Yun Luo, Ganqu Cui, Zhi Wang +32602.11748

RLHF & Preference LearningReasoning & Chain-of-ThoughtTool Use & Agents

2d ago

TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents

The paper introduces Trajectory-Search Rollouts (TSR), a training-time method that uses lightweight tree search to improve the quality of rollouts in multi-turn reinforcement learning for LLM agents. TSR selects high-scoring actions at each turn during rollout generation using task-specific feedback, leading to more informative training trajectories. Experiments on Sokoban, FrozenLake, and WebShop demonstrate that TSR, when combined with PPO and GRPO, achieves up to 15% performance gains and more stable learning.

Introduces a novel training-time trajectory generation method, TSR, that leverages lightweight tree search to construct higher-quality rollouts for multi-turn RL of LLM agents.

Aladin Djuhera, S. Kadhe, Holger Boche2602.11767

RLHF & Preference LearningTool Use & AgentsWorld Models & Planning

Feb 9, 2026

5d ago

Generative AI Chatbots as Digital Adjuncts for Sexual Health Information After Prostate Cancer in Men Who Have Sex With Men: Auto-Netnographic Study

This study used auto-netnography to analyze how four GenAI chatbots (ChatGPT, Claude, Copilot, and Gemini) respond to sexual health questions from a simulated gay male patient post-prostate cancer treatment. The analysis focused on interactional framing, emotional attunement, and specificity of the chatbots' responses, revealing variations in communication styles categorized into four quadrants: structured overview, rational clarity, compassionate perspective, and compassionate precision. The findings suggest that GenAI chatbots can offer supportive and culturally sensitive information in this context, complementing clinical practice by facilitating reflection and access to sensitive information, although they cannot replace professional care.

This paper characterizes the interactional styles of four prominent GenAI chatbots when addressing sexual health concerns of gay men post-prostate cancer treatment, revealing a spectrum of logical-to-empathetic orientations and general-to-specific framings.

Mats Christiansen, Lisbeth Fagerström

RLHF & Preference Learning

Feb 6, 2026

1w ago

PLF-Mamba: Analyzing Individual Milk Yield Dynamics Under Data Scarcity Using Selective State Space Models

This paper introduces PLF-Mamba, a framework combining reinforcement learning (RL)-based dynamic feature gating with the Mamba selective state space model to predict daily milk yield from noisy, short-sequence dairy farming datasets. The RL policy learns to mask uninformative sensor features, while Mamba captures long-range dependencies with linear complexity. Experiments on the MMCows dataset demonstrate PLF-Mamba achieves an average R2 of 0.656 and exhibits lower head-wise variance compared to Transformer baselines, highlighting its robustness to individual cow heterogeneity and data scarcity.

Introduces a novel framework, PLF-Mamba, that integrates RL-based feature gating with the Mamba architecture to improve milk yield prediction in noisy, data-scarce environments.

Jonghyun Kim, Chaebong Sohn

Architecture Design (Transformers, SSMs, MoE)RLHF & Preference Learning

Feb 5, 2026

1w ago

HiCrowd: Hierarchical Crowd Flow Alignment for Dense Human Environments

The paper introduces HiCrowd, a hierarchical framework combining reinforcement learning (RL) and model predictive control (MPC) to improve robot navigation in dense crowds. A high-level RL policy selects a "follow point" to align the robot with compatible crowd flows, while a low-level MPC tracks this point with short-horizon planning for safety. Experiments on real-world and synthetic datasets demonstrate that HiCrowd outperforms reactive and learning-based baselines in navigation efficiency, safety, and reducing freezing behaviors.

Introduces a hierarchical RL-MPC framework (HiCrowd) that leverages pedestrian motion as guidance for robot navigation in dense crowds, improving efficiency and safety compared to existing methods.

Yufei Zhu, Shih-Min Yang, Martin Magnusson +12602.05608

Robotics & Embodied AIWorld Models & PlanningRLHF & Preference Learning

1w ago

Unveiling Implicit Advantage Symmetry: Why GRPO Struggles with Exploration and Difficulty Adaptation

This paper identifies an implicit advantage symmetry in Group Relative Advantage Estimation (GRAE), the reward processing component of GRPO, that hinders exploration and difficulty adaptation in Reinforcement Learning with Verifiable Rewards (RLVR). The authors demonstrate that this symmetry leads to unchanged unsampled action logits and a bias towards medium-difficulty samples. They then propose Asymmetric GRAE (A-GRAE) to dynamically modulate exploration incentives and sample-difficulty focus.

Introduces Asymmetric GRAE (A-GRAE) to address the implicit advantage symmetry in GRPO, improving exploration and difficulty adaptation.

Zhiqi Yu, Zhangquan Chen, Mengting Liu +22602.05548

RLHF & Preference LearningReasoning & Chain-of-Thought

Feb 4, 2026

1w ago

A Human-Centered Privacy Approach (HCP) to AI

This chapter proposes a human-centered privacy (HCP) framework for AI, addressing privacy risks across the AI development lifecycle from data collection to deployment. It integrates technical solutions like federated learning and differential privacy with user perspectives, ethical considerations, and regulatory landscapes. The framework provides design guidelines and case studies, advocating for a multidisciplinary approach to embed privacy into HCAI.

Introduces a human-centered privacy (HCP) framework that holistically integrates technical, ethical, and human factors perspectives to address privacy risks in human-centered AI systems.

Luyi Sun, Wei Xu, Zaifeng Gao2602.04616

RLHF & Preference LearningReasoning & Chain-of-Thought

Feb 2, 2026

1w ago

Human-centered AI to promote youth mental health: a serendipitous natural experiment enabled by a digital health platform

This study examines the impact of different AI-driven nudging strategies within a digital health platform on Indigenous youth compliance with mental health assessments. A natural experiment was created by system disruptions that altered the types of nudges delivered (system-triggered, non-personalized, personalized), allowing the researchers to measure the effect on assessment completion rates. The key finding is that personalized nudges, specifically "Best Picture" messages, significantly improved compliance, highlighting the importance of two-way communication in digital health interventions for this population.

Demonstrates the critical role of personalized, scientist-triggered nudges in maintaining engagement and compliance within a digital health platform designed for Indigenous youth mental health.

T. Katapally, Nadine Elsahli, Jasmin Bhawra

RLHF & Preference LearningReasoning & Chain-of-Thought

1w ago

ECHO-2: A Large-Scale Distributed Rollout Framework for Cost-Efficient Reinforcement Learning

The paper introduces ECHO-2, a distributed reinforcement learning framework designed to optimize the post-training of large language models by distributing rollout execution across remote inference workers. ECHO-2 addresses challenges related to wide-area coordination and policy dissemination latency by treating policy staleness as a user-controlled parameter and overlapping rollout generation, dissemination, and training. Experimental results on GRPO post-training of 4B and 8B models demonstrate that ECHO-2 achieves significant cost efficiency improvements while maintaining comparable RL reward performance.

Introduces ECHO-2, a distributed RL framework that optimizes cost efficiency in LLM post-training by overlapping rollout generation, dissemination, and training, and managing policy staleness.

Jie Xiao, Meng Chen, Qingnan Ren +132602.02192

RLHF & Preference LearningDistributed Systems & HardwareTraining Efficiency & Optimization

1w ago

SLIME: Stabilized Likelihood Implicit Margin Enforcement for Preference Optimization

The paper introduces SLIME (Stabilized Likelihood Implicit Margin Enforcement), a novel reference-free alignment objective for preference optimization in LLMs that addresses the objective mismatch in existing methods like DPO. SLIME decouples preference learning from generation quality by incorporating an anchoring term to maximize the likelihood of preferred responses, a stabilizing penalty to prevent rejected token probabilities from collapsing, and a dual-margin mechanism for boundary shaping. Experiments demonstrate that SLIME outperforms state-of-the-art baselines while maintaining higher generation stability, mitigating issues like unlearning and formatting collapse.

Introduces a novel reference-free alignment objective, SLIME, that decouples preference learning from generation quality by stabilizing likelihoods and enforcing dual margins.

Maksim Afanasyev, Illarion Iov2602.02383

RLHF & Preference LearningTraining Efficiency & OptimizationNatural Language Processing

Jan 30, 2026

2w ago

Human-Centered Explainability in AI-Enhanced UI Security Interfaces: Designing Trustworthy Copilots for Cybersecurity Analysts

This paper investigates the impact of different explanation styles in AI-driven security dashboards on user trust, decision accuracy, and cognitive load. The authors conducted a mixed-methods study with security practitioners, comparing natural language rationales, confidence visualizations, counterfactual explanations, and hybrid approaches. Results demonstrate that explanation style significantly affects user trust calibration, decision accuracy, and cognitive load, leading to design guidelines for integrating explainability into enterprise UIs.

Empirically demonstrates the impact of various explanation styles on security analysts' trust, decision-making, and cognitive load within AI-enhanced UI security interfaces.

Mona Rajhans2601.22653

Interpretability & Mechanistic InterpRLHF & Preference Learning

2w ago

Real-Time Aligned Reward Model beyond Semantics

This paper addresses reward overoptimization in Reinforcement Learning from Human Feedback (RLHF) by introducing Real-Time Aligned Reward Model (R2M). R2M enhances reward models by incorporating real-time feedback from the evolving hidden states of the policy model, going beyond reliance on surface semantic information. The approach mitigates reward discrepancy caused by policy distribution shifts during RL, leading to improved alignment between the reward model and policy model.

Introduces R2M, a novel RLHF framework that aligns the reward model with the real-time distribution shift of the policy by leveraging the evolving hidden states of the policy model.

2601.22664

RLHF & Preference Learning

Jan 29, 2026

2w ago

Factored Causal Representation Learning for Robust Reward Modeling in RLHF

This paper addresses the problem of spurious correlations in reward models used in Reinforcement Learning from Human Feedback (RLHF) by proposing a factored representation learning framework. The framework decomposes contextual embeddings into causal factors sufficient for reward prediction and non-causal factors capturing reward-irrelevant attributes, constraining the reward head to depend only on the causal component. Experiments on mathematical and dialogue tasks demonstrate improved robustness and downstream RLHF performance compared to baselines, with analyses showing mitigation of reward hacking behaviors like exploiting length and sycophantic bias.

Introduces a factored representation learning framework that decomposes contextual embeddings into causal and non-causal factors to improve the robustness of reward models in RLHF.

Yupei Yang, Lin Yang, Wanxi Deng +52601.21350

RLHF & Preference LearningInterpretability & Mechanistic Interp

Jan 23, 2026

3w ago

Deconstructing Taste: Toward a Human-Centered AI Framework for Modeling Consumer Aesthetic Perceptions

This paper introduces a human-centered AI framework for modeling consumer aesthetic perceptions by integrating subjective evaluations with domain-specific and computer vision-based features. The framework jointly models human-derived (consumer and designer) and machine-extracted features to link model outcomes to interpretable design features. The authors demonstrate how perceptual features, design patterns, and consumer interpretations contribute to aesthetic evaluations, enabling better understanding and anticipation of consumer taste.

Introduces a novel human-centered computational framework that explicitly links subjective aesthetic evaluations to interpretable design features through the joint modeling of human-derived and machine-extracted features.

Matthew K. Hong, Joey Li, Alexandre L. S. Filipowicz +62601.17134

RLHF & Preference LearningMultimodal Models

Jan 20, 2026

3w ago

Weather-R1: Logically Consistent Reinforcement Fine-Tuning for Multimodal Reasoning in Meteorology

The authors introduce WeatherQA, a new multimodal reasoning benchmark for meteorology, and Logically Consistent Reinforcement Fine-Tuning (LoCo-RFT) to address the issue of self-contradictory reasoning in VLMs. LoCo-RFT incorporates a logical consistency reward to ensure the model's reasoning aligns with its final answer, crucial for high-stakes domains like meteorology. The resulting model, Weather-R1, achieves a 9.8 percentage point improvement over the baseline on WeatherQA, surpassing supervised fine-tuning, standard RFT, and even the original Qwen2.5-VL-32B.

Introduces Logically Consistent Reinforcement Fine-Tuning (LoCo-RFT) to mitigate self-contradictory reasoning in vision-language models by incorporating a logical consistency reward signal.

Kaiyu Wu, Pucheng Han, Hualong Zhang +22601.14044

Reasoning & Chain-of-ThoughtMultimodal ModelsRLHF & Preference Learning

Jan 19, 2026

3w ago

Heart2Mind: Human-Centered Contestable Psychiatric Disorder Prediction System Using Wearable ECG Monitors

The paper introduces Heart2Mind, a Contestable AI (CAI) system for psychiatric disorder prediction using wearable ECG data, designed to allow clinicians to inspect and revise algorithmic outputs. The system employs a Multi-Scale Temporal-Frequency Transformer (MSTFT) to analyze R-R intervals from ECG sensors, combining time and frequency domain features. Results on the HRV-ACC dataset show MSTFT achieves 91.7% accuracy, and human-centered evaluation demonstrates that experts and the CAI system can effectively collaborate to confirm correct decisions and correct errors through dialogue.

Introduces a contestable AI system, Heart2Mind, that integrates a multi-scale temporal-frequency transformer with self-adversarial explanations and a collaborative chatbot to enable clinicians to scrutinize and refine psychiatric disorder predictions based on wearable ECG data.

Hung Nguyen, Alireza Rahimi, Veronica Whitford +4

Interpretability & Mechanistic InterpRLHF & Preference Learning

3w ago

Using large language model-based artificial intelligence (AI) suspects to train strategic use of evidence: Preliminary evidence of transfer to mock suspect interviews.

This study investigated whether training individuals on the strategic use of evidence (SUE) interview technique using large language model (LLM)-based AI suspects improves their ability to detect deception in subsequent interviews with human mock suspects. Participants were trained with either instruction alone or instruction combined with AI suspect simulations, and the results showed that both training groups used evidence-statement inconsistencies more effectively in their judgments compared to a control group. Furthermore, the group trained with AI suspects demonstrated better accuracy in judging the veracity of human mock suspects, suggesting a potential benefit of AI-enhanced training for SUE.

Demonstrates that training individuals on strategic use of evidence with LLM-based AI suspects can improve their ability to detect deception in human interviews, although the advantage over instruction-only training was limited.

Siyu Li, P. Granhag, Yunhan Shi +4

Reasoning & Chain-of-ThoughtRLHF & Preference Learning

Jan 16, 2026

Integrated Human-Centered Artificial Intelligence (HCAI) Performance & Development Model: Bridging the Policy-to-Practice Divide in Performance Management and Employee Development

This paper addresses the gap between HCAI policy ideals and their practical application in performance management by proposing the Integrated HCAI Performance & Development Model. The model integrates AI-powered analytics with human-centered interpretation, continuous feedback loops, and a strategic HR policy foundation to create a more ethical and developmental performance management process. The key result is a four-component framework designed to align organizational policies with technology-enhanced practices.

Introduces a novel Integrated HCAI Performance & Development Model that bridges the gap between AI-driven analytics and human-centered management in performance evaluation and employee development.

Rosemary Uche Packson-Enajerho

RLHF & Preference LearningReasoning & Chain-of-Thought

Jan 14, 2026

Department of CardiologyJan 14, 2026

From Agents to Governance: Essential AI Skills for Clinicians in the Large Language Model Era

This paper proposes a 3-tier competency framework designed to equip clinicians with essential AI skills for the effective and responsible integration of large language models in clinical practice. The framework spans foundational skills for safe use, intermediate skills for evaluation, and advanced skills for ethical governance and model lifecycle management. The authors argue that integrating this framework into medical education and job descriptions will standardize AI deployment, promote safer clinical practice, and ultimately improve patient outcomes.

Introduces a tiered competency framework to guide clinicians in acquiring the necessary skills for responsible and effective use of AI in clinical settings.

Weiping Cao, Qing Zhang, Jialin Liu +1

RLHF & Preference LearningReasoning & Chain-of-Thought

Jan 10, 2026

Can a Unimodal Language Agent Provide Preferences to Tune a Multimodal Vision-Language Model?

This paper investigates whether a unimodal language model can provide effective feedback to tune a multimodal vision-language model (VLM). They propose a method where a language agent provides feedback to a VLM to adapt text generation according to the agent's preferences. Experiments demonstrate that LLM preference feedback enhances VLM descriptions, leading to improvements of up to 13% in absolute accuracy and a 64.6% preference alignment rate with human judgments.

Demonstrates that a unimodal language model can effectively provide preference feedback to tune a multimodal vision-language model, improving its descriptive accuracy and alignment with human preferences.

Sazia Tabasum Mim, Jack Morris, Manish Dhakal +32601.06424

RLHF & Preference LearningReasoning & Chain-of-ThoughtMultimodal Models

Temple University HospitalJan 10, 2026

Readability and quality of information of AI-generated patient education materials on familial adenomatous polyposis.

This study evaluated the readability and quality of patient education materials (PEMs) generated by five AI chatbots (ChatGPT, Microsoft Copilot, Google Gemini, Perplexity, and Claude AI) in response to questions about familial adenomatous polyposis (FAP). The PEMs exhibited above-average quality as measured by DISCERN and PEMAT scores, but demonstrated poor readability, with a mean reading grade level of 12.44, significantly exceeding the recommended level for patient education. These findings suggest that while AI chatbots can provide valuable information, adjustments are needed to improve the accessibility of AI-generated PEMs for patients with varying literacy levels.

Reveals that AI chatbots generate patient education materials on familial adenomatous polyposis with acceptable quality but poor readability, highlighting a need for improved accessibility.

Nasir Asif, Dara Grobman, Joshua Samudre +3

RLHF & Preference LearningNatural Language ProcessingScientific Discovery & Drug DesignEval Frameworks & Benchmarks

Jan 7, 2026

EvoMDT: a self-evolving multi-agent system for structured clinical decision-making in multi-cancer

The paper introduces EvoMDT, a self-evolving multi-agent system designed to improve structured clinical decision-making in multi-cancer multidisciplinary tumor boards (MDTs). EvoMDT uses a self-evolution loop to dynamically update prompts, consensus weights, and retrieval scope based on expert feedback and outcome signals, enhancing robustness and traceability. Evaluated on oncology QA benchmarks and real-world datasets, EvoMDT outperformed LLM baselines, achieving higher guideline concordance, semantic alignment with expert plans, and comparable decision quality to human MDTs with reduced response time.

Introduces a self-evolving multi-agent system, EvoMDT, that adaptively refines its decision-making process for cancer treatment recommendations based on expert feedback and outcome signals.

Qicai Liu, Zhichao Hu, Tao Huang +10

Reasoning & Chain-of-ThoughtRLHF & Preference Learning

Jan 6, 2026

Reducing Hallucinations in LLMs via Factuality-Aware Preference Learning

The paper introduces Factuality-aware Direct Preference Optimization (F-DPO), an extension of DPO designed to mitigate hallucinations in LLMs by incorporating binary factuality labels into the preference learning process. F-DPO addresses the issue of preference alignment methods reinforcing hallucinations by applying a label-flipping transformation to correct misordered preference pairs and adding a factuality-aware margin to emphasize pairs with clear correctness differences. Experiments across seven open-weight LLMs (1B-14B) demonstrate that F-DPO significantly improves factuality and reduces hallucination rates compared to both base models and standard DPO, while also generalizing to out-of-distribution benchmarks like TruthfulQA.

Introduces F-DPO, a novel and efficient method for reducing hallucinations in LLMs by integrating binary factuality labels into the DPO framework through label-flipping and factuality-aware margins.

Sindhuja Chaduvula, Ahmed Y. Radwan, Azib Farooq +22601.03027

RLHF & Preference LearningConstitutional AI & AI EthicsEval Frameworks & Benchmarks

Jan 5, 2026

YuanLab.aiJan 5, 2026

Yuan3.0 Flash: An Open Multimodal Large Language Model for Enterprise Applications

The paper introduces Yuan3.0 Flash, an open-source 40B parameter MoE multimodal LLM with 3.7B activated parameters, optimized for enterprise applications. To mitigate overthinking in large reasoning models, they propose Reflection-aware Adaptive Policy Optimization (RAPO), a novel RL training algorithm. Yuan3.0 Flash achieves superior performance on enterprise tasks like RAG and table understanding, while maintaining competitive general-purpose reasoning with significantly fewer tokens compared to frontier models.

Introduces Reflection-aware Adaptive Policy Optimization (RAPO), a novel RL training algorithm to regulate overthinking behaviors in large reasoning models.

YuanLab.ai Shawn Wu, Sean Wang, Louie Li +222601.01718

Multimodal ModelsArchitecture Design (Transformers, SSMs, MoE)RLHF & Preference Learning

Jan 4, 2026

College of BusinessJan 4, 2026

Navigating AI Transformation in Healthcare Call Centers: Balancing Efficiency, Ethics, and Human Expertise in HealthConnect’s Transition

This study examines HealthConnect's replacement of 90% of its human workforce with AI in healthcare call centers, assessing the balance between efficiency and ethical considerations. It employs a rapid literature review methodology using qualitative approaches to analyze the benefits and risks of AI adoption, focusing on workforce reduction, algorithmic bias, and patient trust. The key finding is that while AI increases efficiency in routine tasks, it also introduces risks of care prioritization disparities and transparency gaps, necessitating ethical frameworks and structured change management.

Demonstrates the necessity of ethical frameworks like human-centered AI and structured change management models to mitigate risks and ensure responsible AI implementation in healthcare call centers.

Julianne Hutchinson

RLHF & Preference LearningReasoning & Chain-of-Thought

Jan 3, 2026

From RLHF to Direct Alignment: A Theoretical Unification of Preference Learning for Large Language Models

This paper provides a theoretical unification of preference learning methods for aligning LLMs, demonstrating that methods like RLHF, DPO, IPO, KTO, and SimPO can be understood through three orthogonal axes: preference model, regularization mechanism, and data distribution. It formalizes these axes with definitions and theorems, revealing the coverage separation between online and offline methods, scaling laws for reward overoptimization, and failure conditions for direct alignment. The analysis identifies how specific design choices lead to failure modes like length hacking and mode collapse, and it synthesizes empirical findings into a practitioner's decision guide.

Establishes a unifying theoretical framework for preference learning methods by identifying and formalizing three key orthogonal axes: preference model, regularization mechanism, and data distribution.

Tarun Raheja, Nilay Pochhi2601.06108

RLHF & Preference LearningScalable Oversight & Alignment Theory

Jan 2, 2026

Intercultural Human-Centered AI for Automotive Systems: Bridging ASPICE Processes and Intelligent Human Systems Integration

This paper introduces a framework for Intercultural Human-Centred AI that integrates Automotive SPICE (ASPICE) practices with intelligent human systems integration to address challenges in safety-critical automotive systems. The framework uses structured AI-driven assessments with explainable decision layers to improve consistency and auditability, incorporates design principles for intercultural user interface design, and positions intelligent assistant systems as partners to human assessors. Results from a prototype deployed to 12 domain experts processing 424 queries demonstrated high perceived usefulness and strong adoption intent, suggesting the framework's potential for enhancing human-AI collaboration in regulated industries.

Introduces a novel framework integrating ASPICE processes with human-centered AI to improve consistency, cultural inclusivity, and human-AI collaboration in safety-critical automotive systems.

Rüdiger Heimgärtner

RLHF & Preference LearningReasoning & Chain-of-ThoughtRobotics & Embodied AI

2026

Mandaue City CollegeJan 1, 2026

Human-Centered AI in Education: Educators’ Perspectives on Teacher Roles, Ethics, and Pedagogical Value

This qualitative study explores the perspectives of 25 educators in the Philippines on the integration of AI in education, focusing on its impact on teacher roles, ethics, and pedagogical value. The study identifies key themes including the perception of AI as an instructional support tool, the reaffirmation of irreplaceable human dimensions in teaching, systemic barriers to AI adoption, and ethical concerns. The findings emphasize the need for a teacher-centric integration framework that prioritizes infrastructure, professional development, and ethical safeguards.

Provides culturally grounded insights into educators' perspectives on AI in Philippine education, highlighting the importance of human-centered AI integration.

Jiomarie B. Jesus, Rizza R. Caumeran

RLHF & Preference Learning

Jan 1, 2026

Human and AI collaboration failures and model performance gaps in cardiac surgery: a blinded two-phase evaluation of five large language models

This paper evaluates the clinical performance of five large language models (LLMs) in complex cardiac surgery scenarios using a blinded two-phase evaluation by senior surgeons. The study found that while a reasoning-optimized proprietary LLM (O1) performed best, all models exhibited deficits in patient safety, hallucination avoidance, and clinical efficiency. A key finding was the "overacceptance" failure mode, where clinicians initially failed to identify flawed model outputs, suggesting that over-reliance on LLMs could pose significant risks in clinical decision-making.

Reveals a critical human-AI collaboration failure mode of "overacceptance" in cardiac surgery, where clinicians initially miss flawed LLM outputs, highlighting potential risks beyond simple model inaccuracy.

M. Leon, R. B. Feng, M. Q. Flores +5

Reasoning & Chain-of-ThoughtInterpretability & Mechanistic InterpRLHF & Preference Learning

Dec 30, 2025

GRADE: Replacing Policy Gradients with Backpropagation for LLM Alignment

The paper introduces GRADE, a novel method for aligning LLMs with human preferences that replaces policy gradient methods with direct backpropagation. GRADE utilizes the Gumbel-Softmax reparameterization with straight-through estimation (GRADE-STE) to enable end-to-end gradient flow from reward signals through generated tokens to model parameters. Experiments on sentiment-controlled text generation using the IMDB dataset demonstrate that GRADE-STE achieves a 50% relative improvement over PPO, exhibits significantly lower gradient variance, and maintains stable training dynamics.

Introduces GRADE, a method that replaces high-variance policy gradient estimation with direct backpropagation through a differentiable relaxation of the discrete token sampling process for LLM alignment.

Lukas Abrie Nel2601.11574

RLHF & Preference LearningTraining Efficiency & OptimizationArchitecture Design (Transformers, SSMs, MoE)

Dec 29, 2025

Close the Loop: Synthesizing Infinite Tool-Use Data via Multi-Agent Role-Playing

The paper introduces InfTool, a multi-agent framework comprising a User Simulator, Tool-Calling Assistant, and MCP Server, designed to autonomously generate tool-use trajectories from raw API specifications. InfTool closes the loop by training a model using Group Relative Policy Optimization (GRPO) with gated rewards on the synthesized data, iteratively improving the model's ability to generate higher-quality training data. Experiments on the Berkeley Function-Calling Leaderboard (BFCL) show that InfTool significantly improves a 32B model's accuracy from 19.8% to 70.9%, surpassing larger models and rivaling Claude-Opus, using only synthetic data.

Introduces a fully autonomous, self-evolving multi-agent framework, InfTool, for synthesizing diverse and verified tool-use trajectories, eliminating the need for human annotation and enabling significant performance gains in tool-calling accuracy.

Yuwen Li, Wei Zhang, Ze-Jun Huang +82512.23611

Tool Use & AgentsData Curation & Synthetic DataRLHF & Preference Learning

Lattice is designed for desktop

RLHF & Preference Learning

Keywords

Top Labs in This Topic

Recent Papers