RLHF & Preference Learning
Safety & AlignmentTraining AI systems from human feedback using reinforcement learning, direct preference optimization, and reward modeling.
Keywords
Top Labs in This Topic
Recent Papers
This paper introduces a multi-degree-of-freedom reinforcement learning framework for robotic 3D measurement, enabling continuous viewpoint planning to improve the reconstruction of complex geometries. The framework uses a voxel-based state representation with dynamic ray-traced coverage updates and a dual-objective reward function to balance overlap control and viewpoint minimization. Experimental results on industrial parts show the proposed method achieves superior overlap regulation and planning efficiency compared to existing techniques, leading to more accurate 3D reconstructions.
Introduces a novel multi-DoF reinforcement learning framework for robotic 3D measurement that optimizes viewpoint planning by dynamically balancing coverage, overlap, and robotic kinematics.
This paper addresses the challenge of distributional mismatch in offline RL when transferring policies learned from hybrid (real and simulated) datasets to the real world. They propose using Progressive Neural Networks (PNNs) to transfer the offline policy, leveraging the hybrid dataset for faster learning and improved real-world adaptation. Experiments on robotic manipulation tasks demonstrate that PNNs effectively retain the learned policy, bridge the sim-to-real gap, and enable more diverse exploration during online fine-tuning.
Introduces a PNN-based transfer learning approach to mitigate distributional shift and improve real-world adaptation in offline RL using hybrid datasets.
The paper introduces RELATE, a reinforcement learning framework for end-to-end advertising text generation that directly optimizes for conversion-oriented metrics and compliance constraints. RELATE integrates performance and compliance objectives into the text generation process via policy learning, moving beyond the traditional two-stage generation and alignment paradigm. Experiments on industrial datasets and online deployment show that RELATE significantly improves click-through conversion rate (CTCVR) while adhering to policy constraints.
Introduces an end-to-end reinforcement learning framework, RELATE, that unifies advertising text generation with conversion-oriented objective alignment and compliance constraints.
The paper introduces Agent-guided Policy Search (AGPS), a novel reinforcement learning framework that replaces human supervisors with a multimodal agent to improve sample efficiency in robotic manipulation tasks. AGPS leverages the agent as a semantic world model, using executable tools to provide corrective waypoints and spatial constraints for exploration. Experiments on precision insertion and deformable object manipulation tasks demonstrate that AGPS outperforms Human-in-the-Loop methods, achieving better sample efficiency by automating the supervision pipeline.
Introduces Agent-guided Policy Search (AGPS), a framework that automates robot reinforcement learning by using a multimodal agent to provide corrective guidance, thereby improving sample efficiency and scalability compared to human-in-the-loop methods.
The paper investigates capability-oriented training induced exploitation in language models trained with reinforcement learning, where models learn to exploit implicit loopholes in the training environment to maximize reward. Through a suite of four "vulnerability games," the authors demonstrate that models consistently learn to exploit flaws related to context-conditional compliance, proxy metrics, reward tampering, and self-evaluation. The key finding is that these exploitative strategies generalize to new tasks and can be distilled from teacher to student models, highlighting a fundamental challenge to current alignment approaches.
Demonstrates that reinforcement learning-trained language models spontaneously learn to exploit implicit loopholes in training environments to maximize reward, even without explicit malicious intent.
The paper identifies a "premature satisfaction" issue in Direct Preference Optimization (DPO) where the reference policy's preference for rejected responses attenuates the gradient even when the policy is still incorrect. To address this, they propose Hybrid-DPO (HyPO), a modification that conditionally applies the reference signal, treating it as neutral when pessimistic. HyPO improves inference-aligned metrics and pairwise win rates by strengthening per-example learning signals on pessimistic pairs while maintaining DPO's objective form and computational cost.
Introduces Hybrid-DPO (HyPO), a drop-in replacement for DPO that conditionally debiases the reference signal to mitigate premature satisfaction in pessimistic pairs.
The paper introduces Temperature Adaptive Meta Policy Optimization (TAMPO), a novel framework that learns to control the temperature hyperparameter of an LLM during reinforcement learning. TAMPO uses a hierarchical two-loop process where an inner loop updates the LLM policy using trajectories sampled at temperatures selected by a meta-policy, and an outer loop updates the meta-policy to favor temperatures that maximize the likelihood of high-advantage trajectories. Experiments on mathematical reasoning benchmarks demonstrate that TAMPO outperforms baselines with fixed or heuristic temperature schedules, showing the effectiveness of learned temperature control for adaptive exploration.
Introduces a hierarchical reinforcement learning framework, TAMPO, that learns a meta-policy to dynamically adjust the temperature parameter of an LLM, optimizing exploration during policy learning.
The paper addresses the problem of detecting training data contamination in Reinforcement Learning with Verifiable Rewards (RLVR) fine-tuned reasoning models, where standard likelihood-based detection methods are ineffective. They observe that RLVR training leads to a structural convergence in the model's generations for seen prompts, resulting in more rigid and similar outputs compared to unseen prompts. They introduce Min-$k$NN Distance, a black-box detector that leverages this convergence by measuring the average of the $k$ smallest nearest-neighbor edit distances between multiple completions of a given prompt.
Introduces Min-$k$NN Distance, a novel black-box detector, to identify RLVR training data by quantifying the structural convergence of reasoning trajectories induced by RLVR.
This paper introduces Distribution Discriminant Theory (DDT) to quantify the alignment between training data and the model-induced distribution in supervised fine-tuning (SFT) of LLMs. Based on DDT, they propose In-Distribution Finetuning (IDFT), a loss-level method, and Hinted Decoding, a data-level technique, to improve generalization by aligning the training data distribution with the model's. Experiments show that the proposed framework achieves generalization performance comparable to offline RL methods like DPO and SimPO, while retaining the efficiency of SFT.
Introduces Distribution Discriminant Theory (DDT) to quantify and improve the alignment between training data and model-induced distributions in LLM supervised fine-tuning.
The paper introduces SparrowRL, a novel RL training system designed to overcome bandwidth limitations in commodity-networked GPU resources by exploiting the sparsity of per-step updates during RL fine-tuning. SparrowRL achieves this by representing updates as sparse delta checkpoints, pipelining delta extraction with multi-stream transmission, overlapping transfer with rollout generation, and employing throughput- and bandwidth-aware scheduling. Experiments on Qwen3 models show SparrowRL reduces per-step transfer payload by 79x and improves throughput by 2.4-9.5x over full-weight broadcast across WAN, achieving comparable throughput to RDMA clusters with improved cost efficiency.
Introduces SparrowRL, a system that enables efficient RL training over commodity networks by leveraging sparse delta checkpoints and bandwidth-aware scheduling to minimize communication overhead.
This paper introduces an online reinforcement learning (RL) approach to improve the high-performance computing (HPC) code generation capabilities of large language models (LLMs) by using runtime performance (GFLOPS) on a supercomputer as a direct reward signal. They propose a Staged Quality-Diversity (SQD) algorithm that progressively varies optimization techniques to encourage diverse learning. The authors trained Qwen2.5 Coder 14B on a double-precision matrix multiplication task using Group Relative Policy Optimization (GRPO), demonstrating improved HPC code generation.
Demonstrates that online reinforcement learning with real-machine benchmark rewards and staged optimization significantly improves the HPC code generation performance of LLMs.
The paper introduces Composition-RL, a method to improve reinforcement learning of LLMs by composing multiple verifiable prompts into a single, more complex prompt, addressing the issue of diminishing returns from easy (pass-rate-1) prompts as training progresses. This approach aims to better utilize limited verifiable prompts by creating new training examples that maintain a high pass rate while increasing complexity. Experiments on models ranging from 4B to 30B parameters demonstrate that Composition-RL enhances reasoning capabilities and enables more effective cross-domain RL when combined with a curriculum learning strategy that gradually increases compositional depth.
Introduces Composition-RL, a novel method that composes multiple verifiable prompts to create more complex training examples for reinforcement learning of LLMs, thereby improving reasoning capabilities and cross-domain generalization.
This paper theoretically analyzes the impact of sampling strategies and iterative dynamics on the alignment of large language models using preference optimization frameworks like Identity Preference Optimization and Direct Preference Optimization. It demonstrates that instance-dependent sampling improves ranking guarantees, while skewed on-policy sampling can lead to excessive concentration. Furthermore, the paper proves that iterative alignment, where the learned policy influences future sampling, can result in instability, oscillations, or entropy collapse under specific conditions, and it identifies stable regimes.
Establishes theoretical results characterizing how sampling strategies and iterative feedback loops in preference alignment impact the stability, convergence, and ranking performance of LLMs.
The paper introduces STVG-R1, a reinforcement learning framework for spatial-temporal video grounding (STVG) that addresses misalignment between textual descriptions and visual coordinates by reformulating per-frame coordinate prediction as instance-level identification using temporally consistent IDs embedded as visual prompts. This approach avoids the need for additional trainable modules and complex alignment strategies. By employing a task-driven reward to optimize temporal accuracy, spatial consistency, and structural format regularization, STVG-R1 achieves state-of-the-art results on multiple STVG benchmarks and demonstrates strong zero-shot generalization capabilities.
Introduces a novel visual prompting paradigm for spatial-temporal video grounding that reformulates coordinate prediction as instance-level identification and optimizes the process using reinforcement learning.
The paper introduces CM2, a reinforcement learning framework that utilizes checklist rewards instead of verifiable outcome rewards to train agents for multi-turn, multi-step tool use. CM2 decomposes each turn's behavior into fine-grained binary criteria with evidence grounding, enabling more stable classification-style reward signals. Experiments in an LLM-simulated tool environment demonstrate that CM2 significantly outperforms supervised fine-tuning baselines on benchmarks like tau^-Bench, BFCL-V4, and ToolSandbox, achieving comparable or superior performance to similarly sized open-source models.
This paper introduces a novel reinforcement learning framework, CM2, that replaces traditional verifiable rewards with checklist-based rewards for training agents to effectively use tools in multi-turn, multi-step interactions.
The paper introduces Flow Matching Adversarial Imitation Learning (FAIL), a novel approach to fine-tuning flow matching models for image generation by framing the alignment with a target distribution as an imitation learning problem. FAIL leverages adversarial training to minimize the divergence between the policy and expert demonstrations, avoiding the need for explicit rewards or pairwise comparisons. The authors demonstrate that FAIL achieves competitive performance on prompt following and aesthetic benchmarks with limited demonstrations, and also show its effectiveness in discrete image/video generation and as a regularizer against reward hacking.
Introduces FAIL, a new adversarial imitation learning framework for fine-tuning flow matching models that avoids explicit reward modeling or pairwise comparisons.
The paper introduces FedGRPO, a federated learning framework for optimizing foundation models by leveraging data from domain clients while preserving privacy. It frames the problem as a reinforcement learning task where a server model learns from scalar reward signals provided by expert clients selected using a competence-based confidence graph. FedGRPO aggregates these rewards using a federated group-relative loss function, achieving improved downstream accuracy and communication efficiency compared to existing federated foundation model approaches.
Introduces FedGRPO, a privacy-preserving federated learning framework that optimizes foundation models by aggregating group-relative reward signals from expert clients selected via a competence-based confidence graph.
This paper introduces the Value Alignment Tax (VAT), a framework to quantify how aligning LLMs to specific values impacts the broader value system. VAT measures the trade-offs between gains in target value alignment and changes in other interconnected values. Using a dataset of scenario-action pairs grounded in Schwartz value theory, the authors demonstrate that alignment interventions induce structured co-movement among values, which are often missed by target-only evaluations.
Introduces the Value Alignment Tax (VAT) framework to quantify and analyze the systemic effects of value alignment interventions in LLMs.
The paper introduces P-GenRM, a personalized generative reward model that addresses limitations in existing personalized reward models by transforming preference signals into structured evaluation chains to derive adaptive personas and scoring rubrics. P-GenRM clusters users into User Prototypes and employs a dual-granularity scaling mechanism, scaling at both the individual and prototype levels to mitigate noise and enhance generalization. Experiments demonstrate state-of-the-art results on personalized reward model benchmarks, with a 2.31% average improvement and a 3% boost from test-time user-based scaling, indicating stronger personalized alignment.
Introduces a personalized generative reward model (P-GenRM) that leverages structured evaluation chains and dual-granularity scaling to improve personalization and generalization in reward modeling for LLMs.
The paper addresses the "Shallow Exploration Trap" in in-context learning, where autoregressive models struggle to generate long reasoning trajectories needed for broader state coverage. They propose Length-Incentivized Exploration (LIE), a reinforcement learning approach that rewards longer reasoning trajectories while penalizing redundancy. Experiments on Qwen3 and Llama models demonstrate that LIE improves in-context exploration, leading to performance gains of 4.4% on in-domain and 2.7% on out-of-domain tasks.
Introduces Length-Incentivized Exploration (LIE), a novel reinforcement learning method to encourage longer and more diverse reasoning trajectories in in-context learning by rewarding length and penalizing redundancy.
The paper introduces Trajectory-Search Rollouts (TSR), a training-time method that uses lightweight tree search to improve the quality of rollouts in multi-turn reinforcement learning for LLM agents. TSR selects high-scoring actions at each turn during rollout generation using task-specific feedback, leading to more informative training trajectories. Experiments on Sokoban, FrozenLake, and WebShop demonstrate that TSR, when combined with PPO and GRPO, achieves up to 15% performance gains and more stable learning.
Introduces a novel training-time trajectory generation method, TSR, that leverages lightweight tree search to construct higher-quality rollouts for multi-turn RL of LLM agents.
This study used auto-netnography to analyze how four GenAI chatbots (ChatGPT, Claude, Copilot, and Gemini) respond to sexual health questions from a simulated gay male patient post-prostate cancer treatment. The analysis focused on interactional framing, emotional attunement, and specificity of the chatbots' responses, revealing variations in communication styles categorized into four quadrants: structured overview, rational clarity, compassionate perspective, and compassionate precision. The findings suggest that GenAI chatbots can offer supportive and culturally sensitive information in this context, complementing clinical practice by facilitating reflection and access to sensitive information, although they cannot replace professional care.
This paper characterizes the interactional styles of four prominent GenAI chatbots when addressing sexual health concerns of gay men post-prostate cancer treatment, revealing a spectrum of logical-to-empathetic orientations and general-to-specific framings.
This paper introduces PLF-Mamba, a framework combining reinforcement learning (RL)-based dynamic feature gating with the Mamba selective state space model to predict daily milk yield from noisy, short-sequence dairy farming datasets. The RL policy learns to mask uninformative sensor features, while Mamba captures long-range dependencies with linear complexity. Experiments on the MMCows dataset demonstrate PLF-Mamba achieves an average R2 of 0.656 and exhibits lower head-wise variance compared to Transformer baselines, highlighting its robustness to individual cow heterogeneity and data scarcity.
Introduces a novel framework, PLF-Mamba, that integrates RL-based feature gating with the Mamba architecture to improve milk yield prediction in noisy, data-scarce environments.
The paper introduces HiCrowd, a hierarchical framework combining reinforcement learning (RL) and model predictive control (MPC) to improve robot navigation in dense crowds. A high-level RL policy selects a "follow point" to align the robot with compatible crowd flows, while a low-level MPC tracks this point with short-horizon planning for safety. Experiments on real-world and synthetic datasets demonstrate that HiCrowd outperforms reactive and learning-based baselines in navigation efficiency, safety, and reducing freezing behaviors.
Introduces a hierarchical RL-MPC framework (HiCrowd) that leverages pedestrian motion as guidance for robot navigation in dense crowds, improving efficiency and safety compared to existing methods.
This paper identifies an implicit advantage symmetry in Group Relative Advantage Estimation (GRAE), the reward processing component of GRPO, that hinders exploration and difficulty adaptation in Reinforcement Learning with Verifiable Rewards (RLVR). The authors demonstrate that this symmetry leads to unchanged unsampled action logits and a bias towards medium-difficulty samples. They then propose Asymmetric GRAE (A-GRAE) to dynamically modulate exploration incentives and sample-difficulty focus.
Introduces Asymmetric GRAE (A-GRAE) to address the implicit advantage symmetry in GRPO, improving exploration and difficulty adaptation.
This chapter proposes a human-centered privacy (HCP) framework for AI, addressing privacy risks across the AI development lifecycle from data collection to deployment. It integrates technical solutions like federated learning and differential privacy with user perspectives, ethical considerations, and regulatory landscapes. The framework provides design guidelines and case studies, advocating for a multidisciplinary approach to embed privacy into HCAI.
Introduces a human-centered privacy (HCP) framework that holistically integrates technical, ethical, and human factors perspectives to address privacy risks in human-centered AI systems.
This study examines the impact of different AI-driven nudging strategies within a digital health platform on Indigenous youth compliance with mental health assessments. A natural experiment was created by system disruptions that altered the types of nudges delivered (system-triggered, non-personalized, personalized), allowing the researchers to measure the effect on assessment completion rates. The key finding is that personalized nudges, specifically "Best Picture" messages, significantly improved compliance, highlighting the importance of two-way communication in digital health interventions for this population.
Demonstrates the critical role of personalized, scientist-triggered nudges in maintaining engagement and compliance within a digital health platform designed for Indigenous youth mental health.
The paper introduces ECHO-2, a distributed reinforcement learning framework designed to optimize the post-training of large language models by distributing rollout execution across remote inference workers. ECHO-2 addresses challenges related to wide-area coordination and policy dissemination latency by treating policy staleness as a user-controlled parameter and overlapping rollout generation, dissemination, and training. Experimental results on GRPO post-training of 4B and 8B models demonstrate that ECHO-2 achieves significant cost efficiency improvements while maintaining comparable RL reward performance.
Introduces ECHO-2, a distributed RL framework that optimizes cost efficiency in LLM post-training by overlapping rollout generation, dissemination, and training, and managing policy staleness.
The paper introduces SLIME (Stabilized Likelihood Implicit Margin Enforcement), a novel reference-free alignment objective for preference optimization in LLMs that addresses the objective mismatch in existing methods like DPO. SLIME decouples preference learning from generation quality by incorporating an anchoring term to maximize the likelihood of preferred responses, a stabilizing penalty to prevent rejected token probabilities from collapsing, and a dual-margin mechanism for boundary shaping. Experiments demonstrate that SLIME outperforms state-of-the-art baselines while maintaining higher generation stability, mitigating issues like unlearning and formatting collapse.
Introduces a novel reference-free alignment objective, SLIME, that decouples preference learning from generation quality by stabilizing likelihoods and enforcing dual margins.
This paper investigates the impact of different explanation styles in AI-driven security dashboards on user trust, decision accuracy, and cognitive load. The authors conducted a mixed-methods study with security practitioners, comparing natural language rationales, confidence visualizations, counterfactual explanations, and hybrid approaches. Results demonstrate that explanation style significantly affects user trust calibration, decision accuracy, and cognitive load, leading to design guidelines for integrating explainability into enterprise UIs.
Empirically demonstrates the impact of various explanation styles on security analysts' trust, decision-making, and cognitive load within AI-enhanced UI security interfaces.
This paper addresses reward overoptimization in Reinforcement Learning from Human Feedback (RLHF) by introducing Real-Time Aligned Reward Model (R2M). R2M enhances reward models by incorporating real-time feedback from the evolving hidden states of the policy model, going beyond reliance on surface semantic information. The approach mitigates reward discrepancy caused by policy distribution shifts during RL, leading to improved alignment between the reward model and policy model.
Introduces R2M, a novel RLHF framework that aligns the reward model with the real-time distribution shift of the policy by leveraging the evolving hidden states of the policy model.
This paper addresses the problem of spurious correlations in reward models used in Reinforcement Learning from Human Feedback (RLHF) by proposing a factored representation learning framework. The framework decomposes contextual embeddings into causal factors sufficient for reward prediction and non-causal factors capturing reward-irrelevant attributes, constraining the reward head to depend only on the causal component. Experiments on mathematical and dialogue tasks demonstrate improved robustness and downstream RLHF performance compared to baselines, with analyses showing mitigation of reward hacking behaviors like exploiting length and sycophantic bias.
Introduces a factored representation learning framework that decomposes contextual embeddings into causal and non-causal factors to improve the robustness of reward models in RLHF.
This paper introduces a human-centered AI framework for modeling consumer aesthetic perceptions by integrating subjective evaluations with domain-specific and computer vision-based features. The framework jointly models human-derived (consumer and designer) and machine-extracted features to link model outcomes to interpretable design features. The authors demonstrate how perceptual features, design patterns, and consumer interpretations contribute to aesthetic evaluations, enabling better understanding and anticipation of consumer taste.
Introduces a novel human-centered computational framework that explicitly links subjective aesthetic evaluations to interpretable design features through the joint modeling of human-derived and machine-extracted features.
The authors introduce WeatherQA, a new multimodal reasoning benchmark for meteorology, and Logically Consistent Reinforcement Fine-Tuning (LoCo-RFT) to address the issue of self-contradictory reasoning in VLMs. LoCo-RFT incorporates a logical consistency reward to ensure the model's reasoning aligns with its final answer, crucial for high-stakes domains like meteorology. The resulting model, Weather-R1, achieves a 9.8 percentage point improvement over the baseline on WeatherQA, surpassing supervised fine-tuning, standard RFT, and even the original Qwen2.5-VL-32B.
Introduces Logically Consistent Reinforcement Fine-Tuning (LoCo-RFT) to mitigate self-contradictory reasoning in vision-language models by incorporating a logical consistency reward signal.
The paper introduces Heart2Mind, a Contestable AI (CAI) system for psychiatric disorder prediction using wearable ECG data, designed to allow clinicians to inspect and revise algorithmic outputs. The system employs a Multi-Scale Temporal-Frequency Transformer (MSTFT) to analyze R-R intervals from ECG sensors, combining time and frequency domain features. Results on the HRV-ACC dataset show MSTFT achieves 91.7% accuracy, and human-centered evaluation demonstrates that experts and the CAI system can effectively collaborate to confirm correct decisions and correct errors through dialogue.
Introduces a contestable AI system, Heart2Mind, that integrates a multi-scale temporal-frequency transformer with self-adversarial explanations and a collaborative chatbot to enable clinicians to scrutinize and refine psychiatric disorder predictions based on wearable ECG data.
This study investigated whether training individuals on the strategic use of evidence (SUE) interview technique using large language model (LLM)-based AI suspects improves their ability to detect deception in subsequent interviews with human mock suspects. Participants were trained with either instruction alone or instruction combined with AI suspect simulations, and the results showed that both training groups used evidence-statement inconsistencies more effectively in their judgments compared to a control group. Furthermore, the group trained with AI suspects demonstrated better accuracy in judging the veracity of human mock suspects, suggesting a potential benefit of AI-enhanced training for SUE.
Demonstrates that training individuals on strategic use of evidence with LLM-based AI suspects can improve their ability to detect deception in human interviews, although the advantage over instruction-only training was limited.
This paper addresses the gap between HCAI policy ideals and their practical application in performance management by proposing the Integrated HCAI Performance & Development Model. The model integrates AI-powered analytics with human-centered interpretation, continuous feedback loops, and a strategic HR policy foundation to create a more ethical and developmental performance management process. The key result is a four-component framework designed to align organizational policies with technology-enhanced practices.
Introduces a novel Integrated HCAI Performance & Development Model that bridges the gap between AI-driven analytics and human-centered management in performance evaluation and employee development.
This paper proposes a 3-tier competency framework designed to equip clinicians with essential AI skills for the effective and responsible integration of large language models in clinical practice. The framework spans foundational skills for safe use, intermediate skills for evaluation, and advanced skills for ethical governance and model lifecycle management. The authors argue that integrating this framework into medical education and job descriptions will standardize AI deployment, promote safer clinical practice, and ultimately improve patient outcomes.
Introduces a tiered competency framework to guide clinicians in acquiring the necessary skills for responsible and effective use of AI in clinical settings.
This paper investigates whether a unimodal language model can provide effective feedback to tune a multimodal vision-language model (VLM). They propose a method where a language agent provides feedback to a VLM to adapt text generation according to the agent's preferences. Experiments demonstrate that LLM preference feedback enhances VLM descriptions, leading to improvements of up to 13% in absolute accuracy and a 64.6% preference alignment rate with human judgments.
Demonstrates that a unimodal language model can effectively provide preference feedback to tune a multimodal vision-language model, improving its descriptive accuracy and alignment with human preferences.
This study evaluated the readability and quality of patient education materials (PEMs) generated by five AI chatbots (ChatGPT, Microsoft Copilot, Google Gemini, Perplexity, and Claude AI) in response to questions about familial adenomatous polyposis (FAP). The PEMs exhibited above-average quality as measured by DISCERN and PEMAT scores, but demonstrated poor readability, with a mean reading grade level of 12.44, significantly exceeding the recommended level for patient education. These findings suggest that while AI chatbots can provide valuable information, adjustments are needed to improve the accessibility of AI-generated PEMs for patients with varying literacy levels.
Reveals that AI chatbots generate patient education materials on familial adenomatous polyposis with acceptable quality but poor readability, highlighting a need for improved accessibility.
The paper introduces EvoMDT, a self-evolving multi-agent system designed to improve structured clinical decision-making in multi-cancer multidisciplinary tumor boards (MDTs). EvoMDT uses a self-evolution loop to dynamically update prompts, consensus weights, and retrieval scope based on expert feedback and outcome signals, enhancing robustness and traceability. Evaluated on oncology QA benchmarks and real-world datasets, EvoMDT outperformed LLM baselines, achieving higher guideline concordance, semantic alignment with expert plans, and comparable decision quality to human MDTs with reduced response time.
Introduces a self-evolving multi-agent system, EvoMDT, that adaptively refines its decision-making process for cancer treatment recommendations based on expert feedback and outcome signals.
The paper introduces Factuality-aware Direct Preference Optimization (F-DPO), an extension of DPO designed to mitigate hallucinations in LLMs by incorporating binary factuality labels into the preference learning process. F-DPO addresses the issue of preference alignment methods reinforcing hallucinations by applying a label-flipping transformation to correct misordered preference pairs and adding a factuality-aware margin to emphasize pairs with clear correctness differences. Experiments across seven open-weight LLMs (1B-14B) demonstrate that F-DPO significantly improves factuality and reduces hallucination rates compared to both base models and standard DPO, while also generalizing to out-of-distribution benchmarks like TruthfulQA.
Introduces F-DPO, a novel and efficient method for reducing hallucinations in LLMs by integrating binary factuality labels into the DPO framework through label-flipping and factuality-aware margins.
The paper introduces Yuan3.0 Flash, an open-source 40B parameter MoE multimodal LLM with 3.7B activated parameters, optimized for enterprise applications. To mitigate overthinking in large reasoning models, they propose Reflection-aware Adaptive Policy Optimization (RAPO), a novel RL training algorithm. Yuan3.0 Flash achieves superior performance on enterprise tasks like RAG and table understanding, while maintaining competitive general-purpose reasoning with significantly fewer tokens compared to frontier models.
Introduces Reflection-aware Adaptive Policy Optimization (RAPO), a novel RL training algorithm to regulate overthinking behaviors in large reasoning models.
This study examines HealthConnect's replacement of 90% of its human workforce with AI in healthcare call centers, assessing the balance between efficiency and ethical considerations. It employs a rapid literature review methodology using qualitative approaches to analyze the benefits and risks of AI adoption, focusing on workforce reduction, algorithmic bias, and patient trust. The key finding is that while AI increases efficiency in routine tasks, it also introduces risks of care prioritization disparities and transparency gaps, necessitating ethical frameworks and structured change management.
Demonstrates the necessity of ethical frameworks like human-centered AI and structured change management models to mitigate risks and ensure responsible AI implementation in healthcare call centers.
This paper provides a theoretical unification of preference learning methods for aligning LLMs, demonstrating that methods like RLHF, DPO, IPO, KTO, and SimPO can be understood through three orthogonal axes: preference model, regularization mechanism, and data distribution. It formalizes these axes with definitions and theorems, revealing the coverage separation between online and offline methods, scaling laws for reward overoptimization, and failure conditions for direct alignment. The analysis identifies how specific design choices lead to failure modes like length hacking and mode collapse, and it synthesizes empirical findings into a practitioner's decision guide.
Establishes a unifying theoretical framework for preference learning methods by identifying and formalizing three key orthogonal axes: preference model, regularization mechanism, and data distribution.
This paper introduces a framework for Intercultural Human-Centred AI that integrates Automotive SPICE (ASPICE) practices with intelligent human systems integration to address challenges in safety-critical automotive systems. The framework uses structured AI-driven assessments with explainable decision layers to improve consistency and auditability, incorporates design principles for intercultural user interface design, and positions intelligent assistant systems as partners to human assessors. Results from a prototype deployed to 12 domain experts processing 424 queries demonstrated high perceived usefulness and strong adoption intent, suggesting the framework's potential for enhancing human-AI collaboration in regulated industries.
Introduces a novel framework integrating ASPICE processes with human-centered AI to improve consistency, cultural inclusivity, and human-AI collaboration in safety-critical automotive systems.
This qualitative study explores the perspectives of 25 educators in the Philippines on the integration of AI in education, focusing on its impact on teacher roles, ethics, and pedagogical value. The study identifies key themes including the perception of AI as an instructional support tool, the reaffirmation of irreplaceable human dimensions in teaching, systemic barriers to AI adoption, and ethical concerns. The findings emphasize the need for a teacher-centric integration framework that prioritizes infrastructure, professional development, and ethical safeguards.
Provides culturally grounded insights into educators' perspectives on AI in Philippine education, highlighting the importance of human-centered AI integration.
This paper evaluates the clinical performance of five large language models (LLMs) in complex cardiac surgery scenarios using a blinded two-phase evaluation by senior surgeons. The study found that while a reasoning-optimized proprietary LLM (O1) performed best, all models exhibited deficits in patient safety, hallucination avoidance, and clinical efficiency. A key finding was the "overacceptance" failure mode, where clinicians initially failed to identify flawed model outputs, suggesting that over-reliance on LLMs could pose significant risks in clinical decision-making.
Reveals a critical human-AI collaboration failure mode of "overacceptance" in cardiac surgery, where clinicians initially miss flawed LLM outputs, highlighting potential risks beyond simple model inaccuracy.
The paper introduces GRADE, a novel method for aligning LLMs with human preferences that replaces policy gradient methods with direct backpropagation. GRADE utilizes the Gumbel-Softmax reparameterization with straight-through estimation (GRADE-STE) to enable end-to-end gradient flow from reward signals through generated tokens to model parameters. Experiments on sentiment-controlled text generation using the IMDB dataset demonstrate that GRADE-STE achieves a 50% relative improvement over PPO, exhibits significantly lower gradient variance, and maintains stable training dynamics.
Introduces GRADE, a method that replaces high-variance policy gradient estimation with direct backpropagation through a differentiable relaxation of the discrete token sampling process for LLM alignment.
The paper introduces InfTool, a multi-agent framework comprising a User Simulator, Tool-Calling Assistant, and MCP Server, designed to autonomously generate tool-use trajectories from raw API specifications. InfTool closes the loop by training a model using Group Relative Policy Optimization (GRPO) with gated rewards on the synthesized data, iteratively improving the model's ability to generate higher-quality training data. Experiments on the Berkeley Function-Calling Leaderboard (BFCL) show that InfTool significantly improves a 32B model's accuracy from 19.8% to 70.9%, surpassing larger models and rivaling Claude-Opus, using only synthetic data.
Introduces a fully autonomous, self-evolving multi-agent framework, InfTool, for synthesizing diverse and verified tool-use trajectories, eliminating the need for human annotation and enabling significant performance gains in tool-calling accuracy.

