Search papers, labs, and topics across Lattice.
The paper introduces VCoT-Bench, a benchmark for evaluating LLMs' ability to reason like automated theorem provers in the context of Rust verification, using a novel framework called VCoT-Lift to expose solver-level reasoning as human-readable verification steps. This allows for fine-grained evaluation beyond simple pass/fail metrics by assessing robustness, competence, and sensitivity across different proof scenarios. Experiments on ten state-of-the-art LLMs demonstrate significant weaknesses, suggesting that current models are far from matching the reasoning capabilities of automated theorem provers for Rust verification.
LLMs can't reason their way through Rust verification, struggling to complete proofs even with substantial hints, revealing a critical gap in their ability to handle the rigorous demands of secure software development.
As Large Language Models (LLMs) increasingly assist secure software development, their ability to meet the rigorous demands of Rust program verification remains unclear. Existing evaluations treat Rust verification as a black box, assessing models only by binary pass or fail outcomes for proof hints. This obscures whether models truly understand the logical deductions required for verifying nontrivial Rust code. To bridge this gap, we introduce VCoT-Lift, a framework that lifts low-level solver reasoning into high-level, human-readable verification steps. By exposing solver-level reasoning as an explicit Verification Chain-of-Thought, VCoT-Lift provides a concrete ground truth for fine-grained evaluation. Leveraging VCoT-Lift, we introduce VCoT-Bench, a comprehensive benchmark of 1,988 VCoT completion tasks for rigorously evaluating LLMs'understanding of the entire verification process. VCoT-Bench measures performance along three orthogonal dimensions: robustness to varying degrees of missing proofs, competence across different proof types, and sensitivity to the proof locations. Evaluation of ten state-of-the-art models reveals severe fragility, indicating that current LLMs fall well short of the reasoning capabilities exhibited by automated theorem provers.