Search papers, labs, and topics across Lattice.
The paper introduces Duel-Evolve, an evolutionary optimization algorithm for LLMs that iteratively refines outputs at test time using pairwise preferences elicited directly from the LLM itself, eliminating the need for external scalar rewards. Duel-Evolve aggregates noisy pairwise comparisons using a Bayesian Bradley-Terry model to estimate candidate quality and employs Double Thompson Sampling to allocate comparison budget and select parents for generating improved candidates. Experiments on MathBench and LiveCodeBench demonstrate that Duel-Evolve significantly outperforms existing methods by leveraging self-preferences as a strong optimization signal, achieving gains of 20 and 12 percentage points, respectively.
LLMs can significantly improve their performance on complex tasks like math and coding *without any external rewards*, simply by iteratively comparing and refining their own outputs.
Many applications seek to optimize LLM outputs at test time by iteratively proposing, scoring, and refining candidates over a discrete output space. Existing methods use a calibrated scalar evaluator for the target objective to guide search, but for many tasks such scores are unavailable, too sparse, or unreliable. Pairwise comparisons, by contrast, are often easier to elicit, still provide useful signal on improvement directions, and can be obtained from the LLM itself without external supervision. Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates. Duel-Evolve aggregates these noisy candidate comparisons via a Bayesian Bradley-Terry model, yielding uncertainty-aware estimates of candidate quality. These quality estimates guide allocation of the comparison budget toward plausible optima using Double Thompson Sampling, as well as selection of high-quality parents to generate improved candidates. We evaluate Duel-Evolve on MathBench, where it achieves 20 percentage points higher accuracy over existing methods and baselines, and on LiveCodeBench, where it improves over comparable iterative methods by over 12 percentage points. Notably, the method requires no reward model, no ground-truth labels during search, and no hand-crafted scoring function. Results show that pairwise self-preferences provide strong optimization signal for test-time improvement over large, discrete output spaces.