Mar 26, 2025arXiv:2503.21819

Optimizing Safe and Aligned Language Generation: A Multi-Objective GRPO Approach

Xuying Li, Zhuo Li, Yuji Kosuga, Victor Bian

AI Summary

The paper introduces Group Relative Policy Optimization (GRPO) with a multi-label reward regression model to address the challenge of aligning LLMs with human values and safety constraints while balancing conflicting objectives like helpfulness and safety. GRPO optimizes the policy by comparing groups of sampled responses, eliminating the need for a separate value critic and improving training efficiency, and uses a reward model to predict multiple alignment scores. Empirical results on models ranging from 0.5B to 14B parameters demonstrate that GRPO improves safety and quality metrics, achieving better alignment with lower computational cost compared to PPO-based RLHF and DPO.

Key Contribution

Forget RLHF's complexity: GRPO offers a simpler, cheaper, and more robust way to align LLMs across multiple objectives like safety and helpfulness.

Abstract

Aligning large language models (LLMs) with human values and safety constraints is challenging, especially when objectives like helpfulness, truthfulness, and avoidance of harm conflict. Reinforcement Learning from Human Feedback (RLHF) has achieved notable success in steering models, but is complex and can be unstable. Recent approaches such as Direct Preference Optimization (DPO) simplify preference-based fine-tuning but may introduce bias or trade-off certain objectives~\cite{dpo}. In this work, we propose a Group Relative Policy Optimization (GRPO) framework with a multi-label reward regression model to achieve safe and aligned language generation. The GRPO algorithm optimizes a policy by comparing groups of sampled responses, eliminating the need for a separate value critic and improving training efficiency~\cite{grpo}. We train a reward model to predict multiple alignment scores (e.g., safety, helpfulness, etc.), which are combined into a single reward signal. We provide a theoretical derivation for using this learned multi-aspect reward within GRPO and discuss its advantages and limitations. Empirically, our approach improves all the safety and quality metrics evaluated in language generation tasks on model scales (0.5B, 7B, and 14B parameters), demonstrating a robust balance of objectives. We compare GRPO to PPO-based RLHF and DPO, highlighting that GRPO achieves alignment with significantly lower computational cost and explicit multi-objective handling. \textbf{We will open-source all trained models at https://huggingface.co/hydroxai.

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness RLHF & Preference Learning

Citation Metrics

Citations17

Influential citations2

References18

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Optimizing Safe and Aligned Language Generation: A Multi-Objective GRPO Approach

Related Papers