Mar 7, 2026arXiv:2603.08754

Hindsight Credit Assignment for Long-Horizon LLM Agents

Huihan Tan, Xiao-Wen Yang, Hao Chen, Jiejing Shao, Yi Wen, Yuteng Shen, Weihong Luo, Xiku Du, Lan-Zhe Guo, Yu-Feng Li

AI Summary

This paper introduces Hindsight Credit Assignment Policy Optimization (HCAPO), a novel framework for improving LLM agent performance in long-horizon tasks with sparse rewards. HCAPO uses the LLM as a post-hoc critic to refine step-level Q-values through hindsight reasoning and employs a multi-scale advantage mechanism to correct inaccurate value baselines. Experiments on WebShop and ALFWorld show HCAPO outperforms state-of-the-art RL methods, achieving significant success rate improvements over GRPO with the Qwen2.5-7B-Instruct model.

Key Contribution

LLM agents can learn to solve complex, long-horizon tasks much more effectively by using themselves as post-hoc critics to refine Q-values through hindsight reasoning.

Abstract

Large Language Model (LLM) agents often face significant credit assignment challenges in long-horizon, multi-step tasks due to sparse rewards. Existing value-free methods, such as Group Relative Policy Optimization (GRPO), encounter two fundamental bottlenecks: inaccurate step-level Q-value estimation and misaligned value baselines for intermediate states. To address these limitations, we introduce HCAPO, the first framework to integrate hindsight credit assignment into LLM agents. HCAPO leverages the LLM itself as a post-hoc critic to refine step-level Q-values through hindsight reasoning. Furthermore, HCAPO's multi-scale advantage mechanism effectively supplements the inaccurate value baselines at critical decision states. Evaluations across three challenging benchmarks, including WebShop and ALFWorld, demonstrate that HCAPO consistently outperforms state-of-the-art RL methods. Notably, HCAPO achieves a 7.7% improvement in success rate on WebShop and a 13.8% on ALFWorld over GRPO using the Qwen2.5-7B-Instruct model. These results indicate that HCAPO significantly enhances exploration efficiency, promotes concise decision-making, and ensures scalability in complex, long-horizon tasks.

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References51

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Hindsight Credit Assignment for Long-Horizon LLM Agents

Related Papers