Mar 11, 2026arXiv:2603.11226

ExecVerify: White-Box RL with Verifiable Stepwise Rewards for Code Execution Reasoning

Lingxiao Tang, He Ye, Zhaoyang Chu, Muyang Ye, Zhongxin Liu, Xiaoxue Ren, Lingfeng Bao

AI Summary

ExecVerify addresses the limitations of supervised fine-tuning for code execution reasoning in LLMs by incorporating verifiable white-box rewards derived from execution traces, focusing on next-statement and variable prediction. They construct a multi-difficulty dataset using constraint-based program synthesis and apply reinforcement learning to reward correct intermediate execution steps and final outputs. Results show that a 7B model trained with ExecVerify achieves performance comparable to 32B models on code reasoning benchmarks and improves pass@1 on code generation.

Key Contribution

A 7B model, guided by verifiable execution rewards, can now rival the code reasoning of models more than four times its size.

Abstract

Code LLMs still struggle with code execution reasoning, especially in smaller models. Existing methods rely on supervised fine-tuning (SFT) with teacher-generated explanations, primarily in two forms: (1) input-output (I/O) prediction chains and (2) natural-language descriptions of execution traces. However, intermediate execution steps cannot be explicitly verified during SFT, so the training objective can reduce to merely matching teacher explanations. Moreover, training data is typically collected without explicit control over task difficulty. We introduce ExecVerify, which goes beyond text imitation by incorporating verifiable white-box rewards derived from execution traces, including next-statement prediction and variable value/type prediction. Our work first builds a dataset with multiple difficulty levels via constraint-based program synthesis. Then, we apply reinforcement learning (RL) to reward correct answers about both intermediate execution steps and final outputs, aligning the training objective with semantic correctness at each execution step. Finally, we adopt a two-stage training pipeline that first enhances execution reasoning and then transfers to code generation. Experiments demonstrate that a 7B model trained with ExecVerify achieves performance comparable to 32B models on code reasoning benchmarks and improves pass@1 by up to 5.9\% on code generation tasks over strong post-training baselines.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References33

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ExecVerify: White-Box RL with Verifiable Stepwise Rewards for Code Execution Reasoning

Related Papers