Search papers, labs, and topics across Lattice.
The paper investigates the effectiveness of prompt optimization in compound AI systems, finding it statistically no better than random chance in many cases. Through extensive experiments on Claude Haiku and Amazon Nova Lite, they reveal that prompt optimization primarily benefits tasks with exploitable output structures that the model can produce but doesn't by default. They further demonstrate that agent prompts don't significantly interact, and provide a diagnostic tool to predict whether prompt optimization will be worthwhile.
End-to-end prompt optimization is often a waste of time and money, succeeding only when coaxing models into specific output formats they're already capable of.
Prompt optimization in compound AI systems is statistically indistinguishable from a coin flip: across 72 optimization runs on Claude Haiku (6 methods $\times$ 4 tasks $\times$ 3 repeats), 49% score below zero-shot; on Amazon Nova Lite, the failure rate is even higher. Yet on one task, all six methods improve over zero-shot by up to $+6.8$ points. What distinguishes success from failure? We investigate with 18,000 grid evaluations and 144 optimization runs, testing two assumptions behind end-to-end optimization tools like TextGrad and DSPy: (A) individual prompts are worth optimizing, and (B) agent prompts interact, requiring joint optimization. Interaction effects are never significant ($p>0.52$, all $F<1.0$), and optimization helps only when the task has exploitable output structure -- a format the model can produce but does not default to. We provide a two-stage diagnostic: an \$80 ANOVA pre-test for agent coupling, and a 10-minute headroom test that predicts whether optimization is worthwhile -- turning a coin flip into an informed decision.