Search papers, labs, and topics across Lattice.
This paper introduces Self-Induced Outcome Potential (SIOP), a novel turn-level credit assignment method for LLM agents that eliminates the need for explicit turn-level rewards or task-specific verifiers. SIOP clusters final answers into semantic outcome modes and rewards turns that increase the posterior support for reliable future states, effectively approximating information-potential shaping without gold-answer supervision. Experiments across seven search-augmented agentic reasoning benchmarks demonstrate that SIOP significantly improves performance compared to outcome-level baselines, approaching the performance of gold-supervised methods.
Finally, a way to train LLM agents to reason step-by-step without needing humans to check every intermediate thought.
Long-horizon LLM agents depend on intermediate information-gathering turns, yet training feedback is usually observed only at the final answer, because process-level rewards require high-quality human annotation. Existing turn-level shaping methods reward turns that increase the likelihood of a gold answer, but they require answer supervision or stable task-specific verifiers. Conversely, label-free RL methods extract self-signals from output distributions, but mainly at the answer or trajectory level and therefore cannot assign credit to intermediate turns. We propose Self-Induced Outcome Potential (SIOP), which treats semantic clusters of final answers as latent future outcome states for potential-based turn-level credit assignment. For each query, SIOP samples multiple rollouts, clusters final answers into semantic outcome modes, and builds a reliability-aware target distribution over these states. It then rewards turns for increasing posterior support for reliable future states using a tractable cluster-level approximation. The objective generalizes information-potential shaping from gold-answer supervision to settings without task-specific gold verifiers while avoiding the broadcasted rollout-level advantages used by standard GRPO. We formalize the framework, characterize its supervised gold-answer limit, and show that SIOP improves average performance over verifier-free outcome-level baselines on seven search-augmented agentic reasoning benchmarks while approaching a gold-supervised outcome baseline. Code is available at https://github.com/dl-m9/SIOP.git.