Search papers, labs, and topics across Lattice.
This paper addresses the critical need for reliable uncertainty estimation (UE) in code generation by large language models (LLMs), highlighting the unique challenges posed by code, such as token fragility and the intent-code gap. The authors introduce three orthogonal uncertainty axes鈥攍exical, algorithmic, and functional鈥攁nd demonstrate that their ensemble approach significantly improves uncertainty estimation performance across five code LLMs. The proposed method achieves an average AUROC of 0.776, surpassing the best natural language-derived baseline by 8.1 points while offering a cost-effective alternative to existing multi-pass methods.
Code generation requires a unique approach to uncertainty estimation, as a single wrong token can disrupt an entire program's functionality.
Large language models (LLMs) are increasingly deployed as code generators, where silently wrong programs pose real safety and reliability risks. Reliable uncertainty estimation (UE) is essential for selective prediction, human-in-the-loop review, and downstream agentic decisions. Yet most existing code UE methods are inherited from natural language (NL) generation and ignore properties that make code distinct. We argue that code differs from NL in three ways: a single wrong token can break an entire program (token fragility); algorithmic intent and concrete implementation can disagree independently (intent-code gap); and programs can be executed (executability). We instantiate these properties as three orthogonal uncertainty axes: lexical (Top-K token entropy), algorithmic (pseudo-code consistency), and functional (behavioral consistency). Across five code LLMs, our three-axis ensemble improves average AUROC from 0.696 for the strongest NL-derived baseline to 0.776 (+8.1 points). Notably, on Qwen3-14B, our single-pass Top-K token entropy matches the strongest multi-pass baseline while being over 3x cheaper; across models, it remains a competitive low-cost signal. These results suggest that code UE deserves code-specific design rather than direct NL ports.