Search papers, labs, and topics across Lattice.
This study investigates the relationship between the base capabilities of LLM agents and their effectiveness in harness self-evolution, which involves updating external harnesses based on execution evidence. The analysis reveals that while the ability to produce useful harness updates (harness-updating) is consistent across models of varying capabilities, the benefit derived from these updates (harness-benefit) is non-monotonic, with mid-tier models showing the most significant improvements. These findings indicate that enhancing the task-solving capabilities of agents may be more beneficial than focusing solely on their self-evolution capabilities.
Mid-tier LLMs outperform their stronger counterparts in harness self-evolution, challenging assumptions about model capability and adaptability.
LLM agents are increasingly deployed as systems built around editable external harnesses, including prompts, skills, memories and tools, that shape task execution without changing model parameters. Harness self-evolution adapts such agents by updating these harnesses from execution evidence. Yet it remains unclear whether a model's base capability in task-solving predicts its capabilities in harness self-evolution: which models produce useful harness updates, and which actually benefit from them? We analyze two harness self-evolution capabilities: (i) harness-updating, the capability to produce useful persistent harness updates from execution evidence; (ii) harness-benefit, the capability to benefit from updated harnesses during task solving. Our analysis reveals two findings. First, harness-updating is flat in base capability: models from different capability tiers produce harness updates that lead to surprisingly similar gains; even Qwen3.5-9B's updates yield gains comparable to those of Claude Opus~4.6. Second, harness-benefit is non-monotonic in base capability: weak-tier models benefit little from updated harnesses, mid-tier models benefit most, and strong-tier models benefit less than mid-tier. We trace low gains at the weak tier to two failure modes: weak-tier models may fail to activate relevant harness artifacts, or activate them but fail to follow them faithfully. These findings suggest investing capability budget in the task-solving agent rather than the evolver, and targeting harness invocation and long-horizon instruction following in agent training. Our source code is publicly available at https://github.com/A-EVO-Lab/a-evolve/tree/release/harness-evolution.