CambridgeSJTUJun 8, 2026arXiv:2606.09577

Code Is More Than Text: Uncertainty Estimation for Code Generation

Yuling Shi, Caiqi Zhang, Yuexian Li, Haopeng Wang, Yeheng Chen, Nigel Collier, Xiaodong Gu

AI Summary

This paper addresses the critical need for reliable uncertainty estimation (UE) in code generation by large language models (LLMs), highlighting the unique challenges posed by code, such as token fragility and the intent-code gap. The authors introduce three orthogonal uncertainty axes—lexical, algorithmic, and functional—and demonstrate that their ensemble approach significantly improves uncertainty estimation performance across five code LLMs. The proposed method achieves an average AUROC of 0.776, surpassing the best natural language-derived baseline by 8.1 points while offering a cost-effective alternative to existing multi-pass methods.

Key Contribution

Code generation requires a unique approach to uncertainty estimation, as a single wrong token can disrupt an entire program's functionality.

Abstract

Large language models (LLMs) are increasingly deployed as code generators, where silently wrong programs pose real safety and reliability risks. Reliable uncertainty estimation (UE) is essential for selective prediction, human-in-the-loop review, and downstream agentic decisions. Yet most existing code UE methods are inherited from natural language (NL) generation and ignore properties that make code distinct. We argue that code differs from NL in three ways: a single wrong token can break an entire program (token fragility); algorithmic intent and concrete implementation can disagree independently (intent-code gap); and programs can be executed (executability). We instantiate these properties as three orthogonal uncertainty axes: lexical (Top-K token entropy), algorithmic (pseudo-code consistency), and functional (behavioral consistency). Across five code LLMs, our three-axis ensemble improves average AUROC from 0.696 for the strongest NL-derived baseline to 0.776 (+8.1 points). Notably, on Qwen3-14B, our single-pass Top-K token entropy matches the strongest multi-pass baseline while being over 3x cheaper; across models, it remains a competitive low-cost signal. These results suggest that code UE deserves code-specific design rather than direct NL ports.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Code Is More Than Text: Uncertainty Estimation for Code Generation

Related Papers