Search papers, labs, and topics across Lattice.
InfoPO addresses the challenge of training LLM agents to handle underspecified user requests by introducing an information-gain reward that credits interaction turns based on how much they reduce the agent's uncertainty about the correct action. This reward is computed by comparing the agent's action distribution with and without feedback from a specific turn, and then fused with task outcome rewards using a variance-gated mechanism. Experiments across intent clarification, collaborative coding, and tool-augmented decision making show that InfoPO outperforms prompting and multi-turn RL baselines, demonstrating improved robustness and generalization.
LLM agents can learn to elicit crucial information from users by rewarding interaction turns that most reduce uncertainty about the optimal action, leading to better task performance.
Real-world user requests to LLM agents are often underspecified. Agents must interact to acquire missing information and make correct downstream decisions. However, current multi-turn GRPO-based methods often rely on trajectory-level reward computation, which leads to credit assignment problems and insufficient advantage signals within rollout groups. A feasible approach is to identify valuable interaction turns at a fine granularity to drive more targeted learning. To address this, we introduce InfoPO (Information-Driven Policy Optimization), which frames multi-turn interaction as a process of active uncertainty reduction and computes an information-gain reward that credits turns whose feedback measurably changes the agent's subsequent action distribution compared to a masked-feedback counterfactual. It then combines this signal with task outcomes via an adaptive variance-gated fusion to identify information importance while maintaining task-oriented goal direction. Across diverse tasks, including intent clarification, collaborative coding, and tool-augmented decision making, InfoPO consistently outperforms prompting and multi-turn RL baselines. It also demonstrates robustness under user simulator shifts and generalizes effectively to environment-interactive tasks. Overall, InfoPO provides a principled and scalable mechanism for optimizing complex agent-user collaboration. Code is available at https://github.com/kfq20/InfoPO.