Search papers, labs, and topics across Lattice.
The paper introduces Self-Healing Router, a fault-tolerant architecture for tool-using LLM agents that uses a cost-weighted tool graph and Dijkstra's algorithm for deterministic shortest-path routing to handle tool failures. This system employs parallel health monitors to assign priority scores to runtime conditions and reweights edges in the tool graph upon failure, enabling automatic recovery without LLM invocation. Results demonstrate that Self-Healing Router achieves comparable correctness to ReAct while significantly reducing control-plane LLM calls (by 93%) and eliminating silent failures compared to static workflows.
LLM agents can slash control-plane calls by 93% without sacrificing correctness by using a graph-based "self-healing" router that deterministically recovers from tool failures.
Tool-using LLM agents face a reliability-cost tradeoff: routing every decision through the LLM improves correctness but incurs high latency and inference cost, while pre-coded workflow graphs reduce cost but become brittle under unanticipated compound tool failures. We present Self-Healing Router, a fault-tolerant orchestration architecture that treats most agent control-flow decisions as routing rather than reasoning. The system combines (i) parallel health monitors that assign priority scores to runtime conditions such as tool outages and risk signals, and (ii) a cost-weighted tool graph where Dijkstra's algorithm performs deterministic shortest-path routing. When a tool fails mid-execution, its edges are reweighted to infinity and the path is recomputed -- yielding automatic recovery without invoking the LLM. The LLM is reserved exclusively for cases where no feasible path exists, enabling goal demotion or escalation. Prior graph-based tool-use systems (ControlLLM, ToolNet, NaviAgent) focus on tool selection and planning; our contribution is runtime fault tolerance with deterministic recovery and binary observability -- every failure is either a logged reroute or an explicit escalation, never a silent skip. Across 19 scenarios spanning three graph topologies (linear pipeline, dependency DAG, parallel fan-out), Self-Healing Router matches ReAct's correctness while reducing control-plane LLM calls by 93% (9 vs 123 aggregate) and eliminating the silent-failure cases observed in a well-engineered static workflow baseline under compound failures.