AWSMar 15, 2026arXiv:2603.14628

s2n-bignum-bench: A practical benchmark for evaluating low-level code reasoning of LLMs

Balaji Rao, John Harrison, Soonho Kong, Juneyoung Lee, Carlo Lipizzi

AI Summary

The authors introduce s2n-bignum-bench, a new benchmark for evaluating LLMs' ability to generate machine-checkable proofs for low-level cryptographic assembly routines, using the HOL Light theorem prover. This benchmark leverages formally verified assembly routines from the s2n-bignum library used at AWS, providing a more realistic and challenging testbed compared to competition-style mathematics benchmarks. The task involves generating HOL Light proof scripts from formal specifications of the assembly routines.

Key Contribution

LLMs can now be tested on their ability to formally verify real-world cryptographic assembly code, not just competition math problems.

Abstract

Neurosymbolic approaches leveraging Large Language Models (LLMs) with formal methods have recently achieved strong results on mathematics-oriented theorem-proving benchmarks. However, success on competition-style mathematics does not by itself demonstrate the ability to construct proofs about real-world implementations. We address this gap with a benchmark derived from an industrial cryptographic library whose assembly routines are already verified in HOL Light. s2n-bignum is a library used at AWS for providing fast assembly routines for cryptography, and its correctness is established by formal verification. The task of formally verifying this library has been a significant achievement for the Automated Reasoning Group. It involved two tasks: (1) precisely specifying the correct behavior of a program as a mathematical proposition, and (2) proving that the proposition is correct. In the case of s2n-bignum, both tasks were carried out by human experts. In \textit{s2n-bignum-bench}, we provide the formal specification and ask the LLM to generate a proof script that is accepted by HOL Light within a fixed proof-check timeout. To our knowledge, \textit{s2n-bignum-bench} is the first public benchmark focused on machine-checkable proof synthesis for industrial low-level cryptographic assembly routines in HOL Light. This benchmark provides a challenging and practically relevant testbed for evaluating LLM-based theorem proving beyond competition mathematics. The code to set up and use the benchmark is available here: \href{https://github.com/kings-crown/s2n-bignum-bench}{s2n-bignum-bench}.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

s2n-bignum-bench: A practical benchmark for evaluating low-level code reasoning of LLMs

Related Papers