ASUJilinNUDTUCFUNCViennaMay 25, 2026arXiv:2605.25603

Detecting Unfaithful Chain-of-Thought via Circuit-Guided Internal-External Discrepancy

Zhen Tan, Song Wang, Pingjun Hong, Rui Miao, Xin Wang

AI Summary

This paper introduces Circuit-guided Internal-External Discrepancy Scorer (CIE-Scorer), a novel framework for detecting unfaithful chain-of-thought reasoning by comparing internal computation graphs with external reasoning traces. CIE-Scorer efficiently traces sentence-level circuits from key reasoning tokens and uses Fused Gromov-Wasserstein distance to quantify the discrepancy between internal and external reasoning graphs. Experiments on FaithCoT-Bench demonstrate state-of-the-art performance in unfaithfulness detection with reduced circuit construction costs.

Key Contribution

Spotting unfaithful reasoning in LLMs just got easier: a new method efficiently compares a model's internal computations against its stated rationale.

Abstract

Chain-of-thought (CoT) reasoning improves the problem-solving ability of large language models (LLMs), but generated reasoning traces may not faithfully reflect the model's actual decision process. Existing CoT unfaithfulness detectors mainly rely on external signals from generated rationales, such as textual plausibility or answer consistency, while overlooking evidence from the model's internal computation. Although recent circuit tracing methods provide a way to obtain model-internal evidence by tracing how information flows through model components during reasoning, constructing full reasoning circuits for long CoTs is costly and difficult to scale. To address these challenges, we propose Circuit-guided Internal-External Discrepancy Scorer (CIE-Scorer), a framework for instance-level CoT unfaithfulness detection. The key idea is that faithful reasoning traces should align with the model's computational process, whereas unfaithful traces may diverge from it. CIE-Scorer efficiently traces compact sentence-level circuits from informative reasoning tokens, constructs internal and external reasoning graphs, and measures their discrepancy using Fused Gromov--Wasserstein distance. Experiments on four datasets from FaithCoT-Bench show that CIE-Scorer achieves state-of-the-art performance while reducing the cost of circuit construction, demonstrating the effectiveness of combining mechanistic interpretability signals with external reasoning traces for CoT unfaithfulness detection.

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Detecting Unfaithful Chain-of-Thought via Circuit-Guided Internal-External Discrepancy

Related Papers