Apr 2, 2026arXiv:2604.01993

SAFE: Stepwise Atomic Feedback for Error correction in Multi-hop Reasoning

Daeyong Kwon, Soyoung Yoon, Soyoung Yoon, Seung-won Hwang

AI Summary

The paper introduces SAFE, a framework for improving multi-hop reasoning in LLMs by replacing ungrounded Chain-of-Thought reasoning with verifiable sequences of grounded entities. SAFE uses a two-phase approach: first, it filters training data by identifying and removing noisy supervision using a KG-grounded verification pipeline and an atomic error taxonomy; second, it employs a feedback model during inference to dynamically detect and correct ungrounded reasoning steps. Experiments show that SAFE exposes flaws in existing benchmarks and improves accuracy by 8.4 percentage points while ensuring verifiable reasoning trajectories.

Key Contribution

LLMs can achieve significantly higher accuracy and verifiable reasoning paths by explicitly grounding each step in a knowledge graph and dynamically correcting errors during inference.

Abstract

Multi-hop QA benchmarks frequently reward Large Language Models (LLMs) for spurious correctness, masking ungrounded or flawed reasoning steps. To shift toward rigorous reasoning, we propose SAFE, a dynamic benchmarking framework that replaces the ungrounded Chain-of-Thought (CoT) with a strictly verifiable sequence of grounded entities. Our framework operates across two phases: (1) train-time verification, where we establish an atomic error taxonomy and a Knowledge Graph (KG)-grounded verification pipeline to eliminate noisy supervision in standard benchmarks, identifying up to 14% of instances as unanswerable, and (2) inference-time verification, where a feedback model trained on this verified dataset dynamically detects ungrounded steps in real-time. Experimental results demonstrate that SAFE not only exposes the critical flaws of existing benchmarks at train-time, but also significantly outperforms standard baselines, achieving an average accuracy gain of 8.4 pp while guaranteeing verifiable trajectories at inference-time.

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SAFE: Stepwise Atomic Feedback for Error correction in Multi-hop Reasoning

Related Papers