Feb 25, 2026arXiv:2602.22146

Provable Last-Iterate Convergence for Multi-Objective Safe LLM Alignment via Optimistic Primal-Dual

AI Summary

This paper introduces a universal primal-dual framework for safe RLHF, unifying various alignment algorithms and addressing the instability of standard primal-dual methods in policy parameterization. They propose an optimistic primal-dual (OPD) algorithm with predictive updates for primal and dual variables to stabilize saddle-point dynamics. The key result is the establishment of last-iterate convergence guarantees for OPD, demonstrating convergence in distributional space and to a neighborhood of the optimal solution under parameterized policies, highlighting the role of optimism in mitigating oscillations.

Key Contribution

Optimism is the key to stable and convergent safe RLHF, according to a new primal-dual framework that unifies existing alignment algorithms.

Abstract

Reinforcement Learning from Human Feedback (RLHF) plays a significant role in aligning Large Language Models (LLMs) with human preferences. While RLHF with expected reward constraints can be formulated as a primal-dual optimization problem, standard primal-dual methods only guarantee convergence with a distributional policy where the saddle-point problem is in convex-concave form. Moreover, standard primal-dual methods may exhibit instability or divergence in the last iterate under policy parameterization in practical applications. In this work, we propose a universal primal-dual framework for safe RLHF that unifies a broad class of existing alignment algorithms, including safe-RLHF, one-shot, and multi-shot based methods. Building on this framework, we introduce an optimistic primal-dual (OPD) algorithm that incorporates predictive updates for both primal and dual variables to stabilize saddle-point dynamics. We establish last-iterate convergence guarantees for the proposed method, covering both exact policy optimization in the distributional space and convergence to a neighborhood of the optimal solution whose gap is related to approximation error and bias under parameterized policies. Our analysis reveals that optimism plays a crucial role in mitigating oscillations inherent to constrained alignment objectives, thereby closing a key theoretical gap between constrained RL and practical RLHF.

RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Provable Last-Iterate Convergence for Multi-Objective Safe LLM Alignment via Optimistic Primal-Dual

Related Papers