Mar 18, 2026arXiv:2603.18334

Can LLMs Reason Like Automated Theorem Provers for Rust Verification? VCoT-Bench: Evaluating via Verification Chain of Thought

AI Summary

The paper introduces VCoT-Bench, a benchmark for evaluating LLMs' ability to reason like automated theorem provers in the context of Rust verification, using a novel framework called VCoT-Lift to expose solver-level reasoning as human-readable verification steps. This allows for fine-grained evaluation beyond simple pass/fail metrics by assessing robustness, competence, and sensitivity across different proof scenarios. Experiments on ten state-of-the-art LLMs demonstrate significant weaknesses, suggesting that current models are far from matching the reasoning capabilities of automated theorem provers for Rust verification.

Key Contribution

LLMs can't reason their way through Rust verification, struggling to complete proofs even with substantial hints, revealing a critical gap in their ability to handle the rigorous demands of secure software development.

Abstract

As Large Language Models (LLMs) increasingly assist secure software development, their ability to meet the rigorous demands of Rust program verification remains unclear. Existing evaluations treat Rust verification as a black box, assessing models only by binary pass or fail outcomes for proof hints. This obscures whether models truly understand the logical deductions required for verifying nontrivial Rust code. To bridge this gap, we introduce VCoT-Lift, a framework that lifts low-level solver reasoning into high-level, human-readable verification steps. By exposing solver-level reasoning as an explicit Verification Chain-of-Thought, VCoT-Lift provides a concrete ground truth for fine-grained evaluation. Leveraging VCoT-Lift, we introduce VCoT-Bench, a comprehensive benchmark of 1,988 VCoT completion tasks for rigorously evaluating LLMs'understanding of the entire verification process. VCoT-Bench measures performance along three orthogonal dimensions: robustness to varying degrees of missing proofs, competence across different proof types, and sensitivity to the proof locations. Evaluation of ten state-of-the-art models reveals severe fragility, indicating that current LLMs fall well short of the reasoning capabilities exhibited by automated theorem provers.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References40

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Can LLMs Reason Like Automated Theorem Provers for Rust Verification? VCoT-Bench: Evaluating via Verification Chain of Thought

Related Papers