Search papers, labs, and topics across Lattice.
The paper introduces Interactionless Inverse Reinforcement Learning (IIRL) to address the problem of "Alignment Waste" in current AI alignment techniques like RLHF and DPO, where safety objectives are entangled with the agent's policy. IIRL decouples alignment artifact learning from policy optimization, resulting in an inspectable and editable reward model. They further propose the Alignment Flywheel, a human-in-the-loop lifecycle for iteratively refining the reward model through automated audits, transforming safety into a durable asset.
Stop throwing away your alignment efforts: IIRL offers a way to create reusable, inspectable reward models, turning safety from a cost into a durable asset.
AI alignment is growing in importance, yet current approaches suffer from a critical structural flaw that entangles the safety objectives with the agent's policy. Methods such as Reinforcement Learning from Human Feedback and Direct Preference Optimization create opaque, single-use alignment artifacts, which we term Alignment Waste. We propose Interactionless Inverse Reinforcement Learning to decouple alignment artifact learning from policy optimization, producing an inspectable, editable, and model-agnostic reward model. Additionally, we introduce the Alignment Flywheel, a human-in-the-loop lifecycle that iteratively hardens the reward model through automated audits and refinement. This architecture transforms safety from a disposable expense into a durable, verifiable engineering asset.