CUHKTaobaoTencent AIXJTUApr 13, 2026arXiv:2604.11328

Select Smarter, Not More: Prompt-Aware Evaluation Scheduling with Submodular Guarantees

Xiaoyu Ma, Yiwen Li, Haoyue Liu, Zhichao Wang, Ye Chen, Yongxin Guo, Xiaoying Tang

AI Summary

The paper introduces Prompt-Aware Online Evaluation Scheduling (POES), a novel method for optimizing prompt selection by adaptively choosing training examples that best discriminate among candidate prompts. POES formulates prompt optimization as an online adaptive testing problem and integrates IRT-based discrimination, facility-location coverage, and switching-cost-aware warm-start swaps into a submodular objective, providing theoretical guarantees. Experiments across 36 tasks demonstrate that POES achieves state-of-the-art accuracy with minimal token overhead, showing that intelligent evaluation scheduling significantly outperforms naive evaluation strategies.

Key Contribution

You can slash LLM prompt evaluation costs by 35-60% without sacrificing accuracy by intelligently selecting which examples to use.

Abstract

Automatic prompt optimization (APO) hinges on the quality of its evaluation signal, yet scoring every prompt candidate on the full training set is prohibitively expensive. Existing methods either fix a single evaluation subset before optimization begins (principled but prompt-agnostic) or adapt it heuristically during optimization (flexible but unstable and lacking formal guarantees). We observe that APO naturally maps to an online adaptive testing problem: prompts are examinees, training examples are test items, and the scheduler should select items that best discriminate among the strongest candidates. This insight motivates Prompt-Aware Online Evaluation Scheduling (POES), which integrates an IRT-based discrimination utility, a facility-location coverage term, and switching-cost-aware warm-start swaps into a unified objective that is provably monotone submodular, yielding a (1-1/e) greedy guarantee for cold starts and bounded drift for warm-start updates. An adaptive controller modulates the exploration-exploitation balance based on optimization progress. Across 36 tasks spanning three benchmark families, POES achieves the highest overall average accuracy (6.2 percent improvement over the best baseline) with negligible token overhead (approximately 4 percent) at the same evaluation budget. Moreover, principled selection at k = 20 examples matches or exceeds the performance of naive evaluation at k = 30-50, reducing token consumption by 35-60 percent, showing that selecting smarter is more effective than selecting more. Our results demonstrate that evaluation scheduling is a first-class component of APO, not an implementation detail.

Eval Frameworks & Benchmarks Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Select Smarter, Not More: Prompt-Aware Evaluation Scheduling with Submodular Guarantees

Related Papers