Search papers, labs, and topics across Lattice.
This paper investigates different interaction patterns between two language models for code synthesis, contrasting a conventional plan-then-code approach with a review-based approach. They find that having a code specialist generate code and a reasoning model review it significantly outperforms both the plan-then-code approach and individual models, achieving 90.2% pass@1 on HumanEval+. The effectiveness of the review process scales with the richness of the problem specification, suggesting a strategy for optimizing model composition and specification design.
Ditch the planner: language models generate far better code when a code specialist leads and a reasoning model reviews, outperforming GPT-4o on HumanEval+ at a fraction of the cost.
How should two language models interact to produce better code than either can alone? The conventional approach -- a reasoning model plans, a code specialist implements -- seems natural but fails: on HumanEval+, plan-then-code degrades performance by 2.4 percentage points versus the code specialist alone. We show that reversing the interaction changes everything. When the code specialist generates freely and the reasoning model reviews instead of plans, the same two models on the same hardware achieve 90.2% pass@1 -- exceeding GPT-4o (87.2%) and O1 Preview (89.0%) -- on ~$2/hr of commodity GPU. Cross-benchmark validation across 542 problems (HumanEval+ and MBPP+) reveals a moderating variable: review effectiveness scales with specification richness, yielding 4x more improvement on richly-specified problems (+9.8pp) than on lean ones (+2.3pp), while remaining net-positive in both cases. The practical implication is twofold: compose models by their cognitive strengths (reviewers review, coders code), and invest in specification quality to amplify the returns.