RPIJun 10, 2026arXiv:2606.12674

Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents

Kushal Raj Bhandari, Ling Yue, Ching-Yun Ko, Dhaval Patel, Shaowu Pan, Pin-Yu Chen, Jianxi Gao

AI Summary

This paper introduces Evoflux, an inference-time evolutionary search method designed to enhance the execution of tool workflows in compact language models. By evolving typed workflow graphs through structured edits and execution feedback, Evoflux significantly improves the feasibility of executing plans from approximately 3% to between 17% and 24% on MCP-Bench tasks. The findings highlight that execution-grounded search outperforms traditional methods like SFT and ReAct, particularly in scenarios with limited training data.

Key Contribution

Evoflux transforms how compact agents navigate tool workflows, boosting execution success rates from a mere 3% to up to 24% in real-world scenarios.

Abstract

Compact language models (LMs) reduce cost, latency, and deployment risk for tool agents. Yet MCP-style tool use requires more than isolated function calling: an agent must discover tools from live catalogs, satisfy schemas, preserve dependencies across intermediate outputs, and ground final responses in executed evidence. Small planners often generate plausible workflow graphs that fail under tool resolution, parameter validation, dependency tracking, or execution. We argue that this failure mode is poorly handled by small-corpus distillation. A few hundred teacher traces can teach workflow format, but rarely cover the recovery behavior needed to repair failed plans over changing tool catalogs. We introduce Evoflux, an inference-time evolutionary search method that treats compact tool use as the repair of executable tool workflows. It evolves typed workflow graphs through structured edits, execution feedback, adaptive intensity, meta-guided redesign, and diversity pruning. On held-out MCP-Bench tasks spanning live MCP servers and 250 tools, Evoflux raises execution feasibility from roughly 3% to 17-24% across small planners. In contrast, SFT and SFT+DPO on the same search-mined data match, underperform, or collapse below zero-shot performance; ReAct reaches higher peaks, but with higher variance and token cost. These results show that execution-grounded search is more reliable under scarce teacher-trace budgets.

Code Generation & Program Synthesis Tool Use & Agents World Models & Planning

Citation Metrics

Citations0

Influential citations0

References24

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents

Related Papers