Search papers, labs, and topics across Lattice.
The paper introduces CROP, a calibration procedure for language model reasoning traces that provides statistical guarantees on the longest contiguous prefix of the trace that can be safely retained based on step-level risk proxies. CROP selects a calibrated threshold to minimize the probability of including annotated errors in the certified prefix, enabling more effective downstream review or repair. Experiments across six reasoning datasets show that CROP improves repair accuracy by balancing the preservation of valid intermediate reasoning steps with the removal of misleading suffixes, and that standard step-level metrics don't fully capture prefix utility.
Instead of all-or-nothing certification, CROP offers statistical guarantees on the *longest correct prefix* of a reasoning trace, enabling more effective error localization and repair.
Language model reasoning traces are rarely all-or-nothing; they frequently contain valid intermediate steps before a critical error occurs. Existing uncertainty quantification methods typically certify final answers or entire responses, failing to provide statistical guarantees for the proportion of a sequential trace that can be safely retained. To address this, we introduce CROP (Conformal Reasoning Output Prefixes), a verifier-agnostic calibration procedure for clean-prefix certification. Given any step-level risk proxy, CROP selects a calibrated threshold and returns the longest contiguous prefix whose step risk proxies remain below it, routing the uncertified suffix for downstream review or repair. Assuming exchangeability, CROP rigorously controls the marginal probability that the returned prefix contains an annotated error. Across six process-labeled reasoning datasets, we demonstrate that standard step-level metrics such as AUROC do not fully capture prefix utility, suggesting verifiers should instead be evaluated by certified prefix length. Furthermore, CROP balances over- and under-withholding, improving downstream repair accuracy by preserving valid intermediate reasoning while discarding misleading suffixes. Ultimately, this work positions prefix certification as a rigorous, practical bridge between process supervision, abstention, and repair.