Search papers, labs, and topics across Lattice.
This paper introduces GraphPO, a novel reinforcement learning framework that optimizes reasoning models by representing rollouts as directed acyclic graphs, where reasoning steps are edges and semantic states are nodes. By merging semantically equivalent reasoning paths into equivalence classes, GraphPO minimizes redundant exploration and reallocates computational resources towards more diverse paths, addressing the limitations of traditional RLVR approaches. Experimental results demonstrate that GraphPO significantly reduces advantage-estimation variance and enhances reasoning efficiency, outperforming existing chain- and tree-based methods across various benchmarks.
GraphPO slashes redundancy in reasoning model training, enabling more efficient exploration and improved performance on complex tasks.
Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for enhancing the capability of large reasoning models. RLVR typically samples responses independently and optimizes the policy using from final answers. This paradigm has two limitations. First, independently responses often contain similar intermediate reasoning steps, causing redundant exploration and wasted computation. Second, sparse final-answer rewards make it hard to identify useful steps. Tree-based methods partly address this problem by sharing prefixes and comparing branches from the same prefix to provide fine-grained signals. However, tree branches are still expanded independently. When different branches reach similar reasoning states, they cannot share information and repeat similar exploration. Moreover, tree-based methods ignore such dispersion and only perform local comparisons within separate branches, which can lead to higher variance in advantage estimation. To address this challenge, we propose GraphPO (Graph-based Policy Optimization), a novel RL framework that represents rollouts as a directed acyclic graph, with reasoning steps as edges and semantic states summarized from the reasoning paths as nodes. GraphPO merges semantically equivalent reasoning paths into equivalence classes, allowing them to share suffixes and reallocating budget away from redundant expansions to diverse exploration. Furthermore, we assign efficiency advantages to incoming edges and correctness advantages to outgoing edges, thereby improving inference efficiency while deriving process supervision from outcome. Theory shows that GraphPO reduces advantage-estimation variance and enhances reasoning efficiency. Experiments on three LLMs across reasoning and agentic search benchmarks show that GraphPO consistently outperforms chain- and tree-based baselines with the same token budgets or response budgets.