Search papers, labs, and topics across Lattice.
The paper introduces ReLoop, a framework to improve the reliability of LLM-generated optimization code by addressing silent failures where the code executes but produces semantically incorrect formulations. ReLoop employs structured generation, decomposing code production into a four-stage reasoning chain with explicit variable-type reasoning and self-verification, and behavioral verification, testing the formulation's response to parameter perturbation. Experiments across five models and three benchmarks demonstrate that ReLoop significantly improves correctness and execution rates, particularly on complex compositional problems, and the authors also release a new benchmark dataset, RetailOpt-190.
LLMs generating optimization code can be silently wrong 77% of the time, but ReLoop's structured generation and behavioral verification can dramatically improve correctness without needing ground truth.
Large language models (LLMs) can translate natural language into optimization code, but silent failures pose a critical risk: code that executes and returns solver-feasible solutions may encode semantically incorrect formulations, creating a feasibility-correctness gap of up to 90 percentage points on compositional problems. We introduce ReLoop, addressing silent failures from two complementary directions. Structured generation decomposes code production into a four-stage reasoning chain (understand, formalize, synthesize, verify) that mirrors expert modeling practice, with explicit variable-type reasoning and self-verification to prevent formulation errors at their source. Behavioral verification detects errors that survive generation by testing whether the formulation responds correctly to solver-based parameter perturbation, without requiring ground truth -- an external semantic signal that bypasses the self-consistency problem inherent in LLM-based code review. The two mechanisms are complementary: structured generation dominates on complex compositional problems, while behavioral verification becomes the largest single contributor on problems with localized formulation defects. Together with execution recovery via IIS-enhanced diagnostics, ReLoop raises correctness from 22.6% to 31.1% and execution from 72.1% to 100.0% on the strongest model, with consistent gains across five models spanning three paradigms (foundation, SFT, RL) and three benchmarks. We additionally release RetailOpt-190, 190 compositional retail optimization scenarios targeting the multi-constraint interactions where LLMs most frequently fail.