Mar 3, 2026arXiv:2603.02766

EvoSkill: Automated Skill Discovery for Multi-Agent Systems

Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, Tu Vu

AI Summary

The paper introduces EvoSkill, a framework for automatically discovering and refining agent skills through iterative failure analysis, addressing the limitations of hand-crafted skills and low-level optimization in coding agents. EvoSkill analyzes execution failures, proposes skill creation or edits, and materializes them into reusable skill folders, selecting skills based on a Pareto frontier that balances validation performance and model stability. Experiments on OfficeQA and SealQA benchmarks demonstrate significant improvements in exact-match accuracy (7.3% and 12.1% respectively), and zero-shot transfer capabilities to BrowseComp (5.3%), highlighting the framework's ability to produce transferable capabilities.

Key Contribution

Forget hand-crafted skills: EvoSkill's automated discovery process boosts agent accuracy by up to 12% and even transfers zero-shot to new tasks.

Abstract

Coding agents are increasingly used as general-purpose problem solvers, but their flexibility does not by itself confer the domain expertise needed for specialized tasks. Recent work addresses this through \textit{agent skills}: reusable workflows, and code, that augment agents with domain-specific capabilities. Most skills today are hand-crafted, and existing evolutionary approaches optimize low-level artifacts (e.g. prompts \& code) that are tightly coupled to specific models and tasks. We introduce \textbf{EvoSkill}, a self-evolving framework that automatically discovers and refines agent skills through iterative failure analysis. EvoSkill analyzes execution failures, proposes new skills or edits to existing ones, and materializes them into structured, reusable skill folders. A Pareto frontier of agent programs governs selection, retaining only skills that improve held-out validation performance while the underlying model remains frozen. We evaluate EvoSkill on two benchmarks: OfficeQA, a grounded reasoning benchmark over U.S.\ Treasury data, where it improves exact-match accuracy by \textbf{7.3\%} (60.6\% $\to$ 67.9\%); and SealQA, a search-augmented QA benchmark with noisy retrieval, where it yields a \textbf{12.1\%} gain (26.6\% $\to$ 38.7\%). We also investigate the zero-shot transfer capabilties of skills evolved on one task to the other; in particular: skills evolved from SealQA transfers zero-shot to BrowseComp, improving accuracy by \textbf{5.3\%} without modification demonstrating that skill-level optimization produces transferable capabilities beyond the training task.

Code Generation & Program Synthesis Robotics & Embodied AI Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

EvoSkill: Automated Skill Discovery for Multi-Agent Systems

Related Papers