Stanford HAIFeb 17, 2026arXiv:2602.15338

Discovering Implicit Large Language Model Alignment Objectives

Edward Chen, Sanmi Koyejo, Carlos Guestrin

AI Summary

The paper introduces Obj-Disco, a framework to automatically decompose LLM alignment reward signals into sparse, weighted combinations of human-interpretable natural language objectives. It addresses limitations of existing interpretation methods by iteratively analyzing behavioral changes across training checkpoints using a greedy algorithm to identify objectives that explain the residual reward signal. Evaluations across diverse tasks and models demonstrate Obj-Disco captures >90% of reward behavior and identifies latent misaligned incentives.

Key Contribution

Uncover hidden incentives in your reward model: Obj-Disco automatically decomposes alignment rewards into human-interpretable objectives, revealing potential misalignments you might have missed.

Abstract

Large language model (LLM) alignment relies on complex reward signals that often obscure the specific behaviors being incentivized, creating critical risks of misalignment and reward hacking. Existing interpretation methods typically rely on pre-defined rubrics, risking the omission of "unknown unknowns", or fail to identify objectives that comprehensively cover and are causal to the model behavior. To address these limitations, we introduce Obj-Disco, a framework that automatically decomposes an alignment reward signal into a sparse, weighted combination of human-interpretable natural language objectives. Our approach utilizes an iterative greedy algorithm to analyze behavioral changes across training checkpoints, identifying and validating candidate objectives that best explain the residual reward signal. Extensive evaluations across diverse tasks, model sizes, and alignment algorithms demonstrate the framework's robustness. Experiments with popular open-source reward models show that the framework consistently captures > 90% of reward behavior, a finding further corroborated by human evaluation. Additionally, a case study on alignment with an open-source reward model reveals that Obj-Disco can successfully identify latent misaligned incentives that emerge alongside intended behaviors. Our work provides a crucial tool for uncovering the implicit objectives in LLM alignment, paving the way for more transparent and safer AI development.

Interpretability & Mechanistic Interp RLHF & Preference Learning Scalable Oversight & Alignment Theory

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Discovering Implicit Large Language Model Alignment Objectives

Related Papers