Search papers, labs, and topics across Lattice.
The paper introduces C2-Faith, a benchmark derived from PRM800K, to evaluate LLM judges on causal and coverage faithfulness in chain-of-thought (CoT) reasoning. The benchmark uses controlled perturbations to create examples with known causal errors and coverage deletions. Experiments with frontier LLM judges reveal that their performance varies significantly based on task framing, they struggle to localize errors even when detected, and they tend to overestimate the quality of incomplete reasoning.
LLM judges of chain-of-thought reasoning can be easily fooled: they struggle to pinpoint causal errors and consistently overestimate the quality of incomplete reasoning.
Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning, but it remains unclear whether they can reliably assess process faithfulness rather than just answer plausibility. We introduce C2-Faith, a benchmark built from PRM800K that targets two complementary dimensions of faithfulness: causality (does each step logically follow from prior context?) and coverage (are essential intermediate inferences present?). Using controlled perturbations, we create examples with known causal error positions by replacing a single step with an acausal variant, and with controlled coverage deletions at varying deletion rates (scored against reference labels). We evaluate three frontier judges under three tasks: binary causal detection, causal step localization, and coverage scoring. The results show that model rankings depend strongly on task framing, with no single judge dominating all settings; all judges exhibit a substantial gap between detecting an error and localizing it; and coverage judgments are systematically inflated for incomplete reasoning. These findings clarify when LLM judges are dependable and where they fail, and provide practical guidance for selecting judges in process-level evaluation