PurdueJun 8, 2026arXiv:2606.09635

Gradient-Guided Reward Optimization for Inference-time Alignment

AI Summary

This paper introduces Gradient-Guided Reward Optimization (GGRO), a novel method for improving the inference-time alignment of Large Language Models (LLMs) by addressing the limitations of existing sampling-intensive techniques. By utilizing gradient signals from a reward model to inject nudging tokens during decoding, GGRO effectively targets high-uncertainty regions, enhancing the model's performance in safety, helpfulness, and reasoning tasks. Experimental results demonstrate that GGRO not only boosts alignment but also increases the robustness against reward hacking while maintaining low computational overhead.

Key Contribution

GGRO enhances LLM inference-time alignment by intelligently steering generation in high-uncertainty areas, significantly improving response quality and safety.

Abstract

Ensuring the reliability of Large Language Models (LLMs) under distribution drift requires inference-time adaptation. While inference-time alignment methods such as Best-of-$N$ and rejection sampling are widely used, they frame the task as a sampling-intensive, reward-guided search, leading to two key limitations: their performance is bounded by the base model's generation quality, and their reliance on imperfect reward models makes them vulnerable to reward hacking. To address these challenges, we introduce Gradient-Guided Reward Optimization (GGRO), a lightweight inference-time method that performs targeted, minimal intervention during decoding via gradient guidance. Specifically, GGRO monitors token-level entropy to identify high-uncertainty regions indicative of drift or misalignment. Upon detection, it responds by injecting nudging tokens, generated using gradient signals from an off-the-shelf reward model, to steer the generation trajectory rather than merely re-ranking samples. Experiments show that GGRO consistently improves inference-time alignment across safety, helpfulness, and reasoning benchmarks. It also increases coverage of high-quality responses and robustness to reward hacking, with minimal computational overhead. Code is available at https://github.com/lhk2004/GGRO.

RLHF & Preference Learning Scalable Oversight & Alignment Theory

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Gradient-Guided Reward Optimization for Inference-time Alignment

Related Papers