Search papers, labs, and topics across Lattice.
The paper introduces SAFE, a framework for improving multi-hop reasoning in LLMs by replacing ungrounded Chain-of-Thought reasoning with verifiable sequences of grounded entities. SAFE uses a two-phase approach: first, it filters training data by identifying and removing noisy supervision using a KG-grounded verification pipeline and an atomic error taxonomy; second, it employs a feedback model during inference to dynamically detect and correct ungrounded reasoning steps. Experiments show that SAFE exposes flaws in existing benchmarks and improves accuracy by 8.4 percentage points while ensuring verifiable reasoning trajectories.
LLMs can achieve significantly higher accuracy and verifiable reasoning paths by explicitly grounding each step in a knowledge graph and dynamically correcting errors during inference.
Multi-hop QA benchmarks frequently reward Large Language Models (LLMs) for spurious correctness, masking ungrounded or flawed reasoning steps. To shift toward rigorous reasoning, we propose SAFE, a dynamic benchmarking framework that replaces the ungrounded Chain-of-Thought (CoT) with a strictly verifiable sequence of grounded entities. Our framework operates across two phases: (1) train-time verification, where we establish an atomic error taxonomy and a Knowledge Graph (KG)-grounded verification pipeline to eliminate noisy supervision in standard benchmarks, identifying up to 14% of instances as unanswerable, and (2) inference-time verification, where a feedback model trained on this verified dataset dynamically detects ungrounded steps in real-time. Experimental results demonstrate that SAFE not only exposes the critical flaws of existing benchmarks at train-time, but also significantly outperforms standard baselines, achieving an average accuracy gain of 8.4 pp while guaranteeing verifiable trajectories at inference-time.