Search papers, labs, and topics across Lattice.
This paper introduces ConSelf, a self-improvement method for code generation that circumvents the need for external teacher models or test oracles. ConSelf uses code semantic entropy to create a curriculum of learnable problems and consensus-driven direct preference optimization (Con-DPO) to refine the model based on the agreement between program behaviors. Experiments show ConSelf significantly improves code generation performance across various benchmarks and LLMs, demonstrating effective self-improvement without external supervision.
LLMs can bootstrap their code generation abilities by focusing on problems where they show diverse solution attempts and then reinforcing solutions that exhibit behavioral consensus.
Improving the code generation capabilities of large language models (LLMs) typically relies on supervised fine-tuning or preference optimization, both of which require costly external resources such as powerful teacher models or reliable test units. However, in real-world scenarios, it is much harder to obtain reference solutions and test oracles than problem descriptions and test inputs. In this paper, we tackle a challenging yet realistic question: Can a code language model improve itself without access to a superior teacher and a test oracle? To answer this, we propose ConSelf, a self-improving approach built upon two key ideas. First, we introduce code semantic entropy, a novel metric that measures problem-level uncertainty by assessing the functional diversity of program behaviors, enabling a curriculum construction with the most learnable problems. Second, we present consensus-driven direct preference optimization (Con-DPO), a preference-based fine-tuning method that weights each preference pair by its behavioral consensus, thereby mitigating the impact of noisy self-generated supervision. Experiments on various benchmarks and backbone LLMs demonstrate that ConSelf significantly outperforms baselines, validating the effectiveness of semantic entropy-based curriculum construction and consensus-driven optimization in improving code generation without external supervision.