Search papers, labs, and topics across Lattice.
Tool Attention is introduced as a middleware mechanism to reduce the overhead of the Model Context Protocol (MCP) in LLM agents, which suffers from a "Tools Tax" due to eager schema injection. It uses gated attention over tools, incorporating Intent Schema Overlap (ISO) scores, state-aware gating, and lazy schema loading to keep a compact summary pool in context. Evaluated on a simulated benchmark, Tool Attention reduces per-turn tool tokens by 95.0% and raises effective context utilization from 24% to 91%.
LLM agents are wasting up to 60k tokens per turn on unnecessary tool schemas – Tool Attention slashes this "Tools Tax" by 95% and unlocks truly scalable agentic workflows.
The Model Context Protocol (MCP) has become a common interface for connecting large language model (LLM) agents to external tools, but its reliance on stateless, eager schema injection imposes a hidden per-turn overhead the MCP Tax or Tools Tax that practitioner reports place between roughly 10k and 60k tokens in typical multi-server deployments. This payload inflates the key-value cache, is associated with reasoning degradation as context utilization approaches published fracture points around 70%, and turns token budgets into a recurring operational cost. We introduce Tool Attention, a middleware-layer mechanism that generalizes the"Attention Is All You Need"paradigm from self-attention over tokens to gated attention over tools. Tool Attention combines (i) an Intent Schema Overlap (ISO) score from sentence embeddings, (ii) a state-aware gating function enforcing preconditions and access scopes, and (iii) a two-phase lazy schema loader that keeps a compact summary pool in context and promotes full JSON schemas only for top-k gated tools. We evaluate on a simulated 120-tool, six-server benchmark whose per-server token counts are calibrated to public audits of real MCP deployments. In this simulation, Tool Attention directly reduces measured per-turn tool tokens by 95.0% (47.3k ->2.4k) and raises effective context utilization (a token-ratio quantity) from 24% to 91%. End-to-end figures for task success, latency, cost, and reasoning quality are reported as projections derived from the measured token counts combined with published deployment telemetry; they are not measured on live LLM agents, and we mark projected values explicitly throughout. Taken together, the results support a simple thesis: protocol-level efficiency, not raw context length, is a binding constraint on scalable gentic systems. The code for this work is accessible at https://github.com/asadani/tool-attention