EmilyFeb 25, 2026arXiv:2602.22480

VeRO: An Evaluation Harness for Agents to Optimize Agents

Varun Ursekar, Apaar Shanker, Veronica Chatrath, Yuan, Xue, Sam Denton

AI Summary

The paper introduces VeRO, a novel evaluation harness designed to systematically assess the performance of coding agents in agent optimization tasks, which involve iterative improvement through edit-execute-evaluate cycles. VeRO provides versioned agent snapshots, budget-controlled evaluation, and structured execution traces to address the challenges of evaluating agents that interleave deterministic code with stochastic LLM completions. An empirical study using VeRO compares optimizer configurations and identifies modifications that reliably improve target agent performance across a benchmark suite of target agents and tasks.

Key Contribution

Finally, a rigorous benchmark, VERO, to understand how coding agents can iteratively improve *other* agents, revealing which optimization strategies actually work.

Abstract

An important emerging application of coding agents is agent optimization: the iterative improvement of a target agent through edit-execute-evaluate cycles. Despite its relevance, the community lacks a systematic understanding of coding agent performance on this task. Agent optimization differs fundamentally from conventional software engineering: the target agent interleaves deterministic code with stochastic LLM completions, requiring structured capture of both intermediate reasoning and downstream execution outcomes. To address these challenges, we introduce VERO (Versioning, Rewards, and Observations), which provides (1) a reproducible evaluation harness with versioned agent snapshots, budget-controlled evaluation, and structured execution traces, and (2) a benchmark suite of target agents and tasks with reference evaluation procedures. Using VERO, we conduct an empirical study comparing optimizer configurations across tasks and analyzing which modifications reliably improve target agent performance. We release VERO to support research on agent optimization as a core capability for coding agents.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VeRO: An Evaluation Harness for Agents to Optimize Agents

Related Papers