May 5, 2026arXiv:2605.03356

POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference

AI Summary

The paper introduces POSTCONDBENCH, a new benchmark for evaluating the correctness and completeness of LLM-generated formal postconditions for Python and Java methods. The benchmark uses real-world code from open-source projects and assesses completeness by measuring the ability of postconditions to discriminate between correct and defective implementations. Experiments using POSTCONDBENCH reveal a significant gap between correctness and completeness in SOTA LLMs, particularly when dealing with complex methods and repository-level dependencies.

Key Contribution

LLMs can generate formally correct postconditions for code, but they often miss crucial details, especially in complex, real-world scenarios.

Abstract

Formal postconditions precisely characterize program behavior and support debugging, testing, and verification, but writing them requires substantial expertise and effort. This has motivated recent work on automatically generating postconditions from code and natural-language artifacts using large language models (LLMs). However, evaluation remains a key bottleneck. Existing benchmarks primarily emphasize correctness under limited evaluation settings, often relying on surface-form matching or manual assessment on small or synthetic datasets. We introduce POSTCONDBENCH, a multilingual benchmark for evaluating method-level postcondition generation from real-world software. POSTCONDBENCH comprises 420 Python and Java tasks drawn from 121 open-source projects, each paired with a high-quality ground-truth postcondition set constructed with expert involvement. To enable automatic evaluation, POSTCONDBENCH provides a runnable execution environment and operationalizes completeness via defect discrimination: a postcondition set is more complete if it is violated by more defective implementations while remaining satisfied on correct executions. Using POSTCONDBENCH, we formulate three generation settings and evaluate five SOTA LLMs. Our results reveal a substantial gap between correctness and completeness, and show that repository-level dependencies and method complexity exacerbate this gap.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Citation Metrics

Citations0

Influential citations0

References57

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference

Related Papers