HKUSTApr 2, 2026arXiv:2604.01851

Can Large Language Models Model Programs Formally?

Zhiyong Chen, Jialun Cao, Jiarong Wu, Shing-Chi Cheung

AI Summary

This paper introduces Model-Bench, a benchmark of 400 Python programs derived from HumanEval, MBPP, and LiveCodeBench, to evaluate the ability of LLMs to model programs into formal specifications suitable for model checking. Experiments using Model-Bench reveal significant limitations in LLMs' program modeling capabilities, highlighting a gap in their ability to translate code into formal verification-ready specifications. The authors also provide insights into potential future research directions for improving LLMs in this domain.

Key Contribution

LLMs struggle to translate code into formal specifications, as evidenced by their poor performance on the new Model-Bench benchmark, revealing a critical gap in their ability to support formal verification.

Abstract

In the digital age, ensuring the correctness, safety, and reliability of software through formal verification is paramount, particularly as software increasingly underpins critical infrastructure. Formal verification, split into theorem proving and model checking, provides a feasible and reliable path. Unlike theorem proving, which yields notable advances, model checking has been less focused due to the difficulty of automatic program modeling. To fill this gap, we introduce Model-Bench, a benchmark and an accompanying pipeline for evaluating and improving LLMs'program modeling capability by modeling Python programs into verification-ready model checking specifications checkable by its accompanying model checker. Model-Bench comprises 400 Python programs derived from three well-known benchmarks (HumanEval, MBPP, and LiveCodeBench). Our extensive experiments reveal significant limitations in LLMs'program modeling and further provide inspiring directions.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References38

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Can Large Language Models Model Programs Formally?

Related Papers