Apr 23, 2026arXiv:2604.21816

Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows

AI Summary

Tool Attention is introduced as a middleware mechanism to reduce the overhead of the Model Context Protocol (MCP) in LLM agents, which suffers from a "Tools Tax" due to eager schema injection. It uses gated attention over tools, incorporating Intent Schema Overlap (ISO) scores, state-aware gating, and lazy schema loading to keep a compact summary pool in context. Evaluated on a simulated benchmark, Tool Attention reduces per-turn tool tokens by 95.0% and raises effective context utilization from 24% to 91%.

Key Contribution

LLM agents are wasting up to 60k tokens per turn on unnecessary tool schemas – Tool Attention slashes this "Tools Tax" by 95% and unlocks truly scalable agentic workflows.

Abstract

The Model Context Protocol (MCP) has become a common interface for connecting large language model (LLM) agents to external tools, but its reliance on stateless, eager schema injection imposes a hidden per-turn overhead the MCP Tax or Tools Tax that practitioner reports place between roughly 10k and 60k tokens in typical multi-server deployments. This payload inflates the key-value cache, is associated with reasoning degradation as context utilization approaches published fracture points around 70%, and turns token budgets into a recurring operational cost. We introduce Tool Attention, a middleware-layer mechanism that generalizes the"Attention Is All You Need"paradigm from self-attention over tokens to gated attention over tools. Tool Attention combines (i) an Intent Schema Overlap (ISO) score from sentence embeddings, (ii) a state-aware gating function enforcing preconditions and access scopes, and (iii) a two-phase lazy schema loader that keeps a compact summary pool in context and promotes full JSON schemas only for top-k gated tools. We evaluate on a simulated 120-tool, six-server benchmark whose per-server token counts are calibrated to public audits of real MCP deployments. In this simulation, Tool Attention directly reduces measured per-turn tool tokens by 95.0% (47.3k ->2.4k) and raises effective context utilization (a token-ratio quantity) from 24% to 91%. End-to-end figures for task success, latency, cost, and reasoning quality are reported as projections derived from the measured token counts combined with published deployment telemetry; they are not measured on live LLM agents, and we mark projected values explicitly throughout. Taken together, the results support a simple thesis: protocol-level efficiency, not raw context length, is a binding constraint on scalable gentic systems. The code for this work is accessible at https://github.com/asadani/tool-attention

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References37

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows

Related Papers