Mar 16, 2026arXiv:2603.15510

Not All Invariants Are Equal: Curating Training Data to Accelerate Program Verification with SLMs

Ido Pinto, Yizhak Yisrael Elboher, Haoze Wu, Nina Narodytska, Guy Katz

AI Summary

This paper addresses the challenge of generating high-quality training data for synthesizing inductive loop invariants, a bottleneck in automated program verification. They introduce Wonda, a data curation pipeline that refines noisy verifier-generated invariants using AST-based normalization, LLM-driven semantic rewriting, and augmentation with provable quality guarantees. Fine-tuning small language models (SLMs) on the curated dataset significantly improves performance, with a 4B parameter model matching the utility of a GPT-OSS-120B baseline and approaching GPT-5.2 performance on challenging instances from InvBench.

Key Contribution

Forget massive models: fine-tuning a small 4B parameter model on carefully curated data can match or even approach the performance of 100B+ parameter LLMs in program verification tasks.

Abstract

The synthesis of inductive loop invariants is a critical bottleneck in automated program verification. While Large Language Models (LLMs) show promise in mitigating this issue, they often fail on hard instances, generating invariants that are invalid or computationally ineffective. While fine-tuning is a natural route to mitigate this limitation, obtaining high-quality training data for invariant generation remains an open challenge. We present a rigorous data curation pipeline designed to extract high-quality training signals from raw verifier-generated invariants. First, we formalize the properties required for a high-quality training invariant. Second, we propose Wonda, a pipeline that refines noisy data via AST-based normalization, followed by LLM-driven semantic rewriting and augmentation with provable quality guarantees. We demonstrate that fine-tuning Small Language Models (SLMs) on this curated dataset result in consistent and significant performance gain. In particular, a fine-tuned 4B parameter model matches the utility of a GPT-OSS-120B baseline and approaches the state-of-the-art GPT-5.2, without incurring reasoning-time overhead. On challenging instances from the recent InvBench evaluation suite, our approach doubles the invariant correctness and speedup rates of base models; and improves their Virtual Best Performance (VBP) rates on the verification task by up to 14.2%.

Code Generation & Program Synthesis Data Curation & Synthetic Data Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Not All Invariants Are Equal: Curating Training Data to Accelerate Program Verification with SLMs

Related Papers