PurdueFeb 24, 2026arXiv:2602.20457

Oracle-Robust Online Alignment for Large Language Models

AI Summary

This paper addresses the challenge of online alignment of large language models (LLMs) when preference feedback is misspecified, meaning the observed oracle deviates from the true underlying oracle. They formulate an oracle-robust online alignment objective using a pointwise oracle uncertainty set and worst-case optimization. The authors demonstrate that for log-linear policies, this robust objective decomposes into the original loss function plus a sensitivity penalty, and they develop projected stochastic composite updates achieving $\widetilde{O}(\varepsilon^{-2})$ oracle complexity for approximate stationarity.

Key Contribution

Even with noisy or misspecified preference feedback, LLMs can be robustly aligned online by penalizing sensitivity to oracle uncertainty.

Abstract

We study online alignment of large language models under misspecified preference feedback, where the observed preference oracle deviates from an ideal but unknown ground-truth oracle. The online LLM alignment problem is a bi-level reinforcement problem due to the coupling between data collection and policy updates. Recently, the problem has been reduced to tractable single-level objective in the SAIL (Self-Improving Efficient Online Alignment) framework. In this paper, we introduce a pointwise oracle uncertainty set in this problem and formulate an oracle-robust online alignment objective as a worst-case optimization problem. For log-linear policies, we show that this robust objective admits an exact closed-form decomposition into the original loss function plus an explicit sensitivity penalty. We develop projected stochastic composite updates for the resulting weakly convex objective and prove $\widetilde{O}(\varepsilon^{-2})$ oracle complexity for reaching approximate stationarity.

RLHF & Preference Learning Scalable Oversight & Alignment Theory

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Oracle-Robust Online Alignment for Large Language Models

Related Papers