Search papers, labs, and topics across Lattice.
This paper introduces PathRouter, a novel training framework for agentic Graph Retrieval-Augmented Generation (GraphRAG) that addresses the issues of answer-path reward aliasing and search-update ambiguity in reinforcement learning. By evaluating trajectories based on both answer correctness and evidence-path overlap, PathRouter effectively discourages shortcut reinforcement while promoting evidence-seeking behavior. Experimental results demonstrate that PathRouter significantly enhances answer F1 scores and evidence-path overlap across multiple QA benchmarks, achieving notable improvements in model performance.
PathRouter reduces reliance on shortcuts in reinforcement learning, leading to more reliable and contextually rich decision-making in language-model agents.
Agentic GraphRAG trains language-model agents to iteratively retrieve and reason over graph-structured evidence, enabling more accurate and context-aware decision-making by efficiently navigating complex information networks. However, outcome-only reinforcement learning suffers from \textit{\textbf{answer-path reward aliasing}}, where correct answers may come from shortcuts rather than useful evidence paths. It also exhibits \textit{\textbf{search-update ambiguity}}, as scalar trajectory-level feedback does not indicate which retrieval actions to adjust. To mitigate these shortcomings, we present PathRouter, a path-aware training framework for agentic GraphRAG. PathRouter jointly evaluates each trajectory along answer correctness and evidence-path overlap, yielding four trajectory categories with differentiated GRPO advantage scaling that suppresses shortcut reinforcement while preserving evidence-seeking behavior. For evidence-poor trajectories, a frozen gold-evidence teacher provides token-level KL guidance on reasoning and search-query tokens, excluding answer tokens to avoid direct response imitation. Experiments on six QA benchmarks across three model sizes show that PathRouter consistently improves answer F1 and evidence-path overlap, achieving average F1 gains of 3.1 on 3B and 4.9 on 7B models compared to a strong baseline.