University of Science and TechnologyApr 19, 2026arXiv:2604.17308

SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

Ziao Zhang, Kou Shi, Shiting Huang, Avery Nie, Yu Zeng, Yiming Zhao, Zhen Fang, Qishen Su, Qisheng Su, Haibo Qiu, Wei Yang, Qingnan Ren, Shun Zou, Shu Zou, Wenxuan Huang, Lin Chen, Zehui Chen, Feng Zhao

AI Summary

The paper introduces SkillFlow, a benchmark designed to evaluate autonomous agents' ability to discover, repair, and maintain a library of skills over a lifetime of experience across 166 tasks. The benchmark uses Domain-Agnostic Execution Flows (DAEF) to structure tasks and enable skill sharing. Experiments with Claude Opus 4.6 show that lifelong skill evolution improves task success by 8.43%, but also reveal that high skill usage doesn't always translate to high utility, highlighting challenges in skill discovery and transfer.

Key Contribution

Even the best LLMs struggle to effectively discover, refine, and reuse skills over a lifetime of experience, suggesting current benchmarks significantly overestimate real-world agentic capabilities.

Abstract

As the capability frontier of autonomous agents continues to expand, they are increasingly able to complete specialized tasks through plug-and-play external skills. Yet current benchmarks mostly test whether models can use provided skills, leaving open whether they can discover skills from experience, repair them after failure, and maintain a coherent library over time. We introduce SkillFlow, a benchmark of 166 tasks across 20 families in which task construction within each family follows a Domain-Agnostic Execution Flow (DAEF) that defines an agent workflow framework, allowing these tasks to share a consistent workflow. Agents are evaluated under an Agentic Lifelong Learning protocol in which they begin without skills, solve tasks sequentially within each family, externalize lessons through trajectory- and rubric-driven skill patches, and carry the updated library forward. Experiments reveal a substantial capability gap. For Claude Opus 4.6, lifelong skill evolution improves task success from 62.65% to 71.08% (+8.43 points). However, high skill usage does not necessarily imply high utility: Kimi K2.5 gains only +0.60 points despite 66.87% skill usage, while Qwen-Coder-Next reaches only a 44.58% task completion rate and still regresses relative to the vanilla setting. SkillFlow contributes a structured testbed for this direction and an in-depth empirical analysis of skill discovery, patching, transfer, and their failure modes under lifelong evaluation.

Eval Frameworks & Benchmarks Robotics & Embodied AI Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References44

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

Related Papers