Search papers, labs, and topics across Lattice.
The paper introduces SKILLS, a benchmark framework for evaluating LLM agents in executing telecommunications operations workflows via real API interfaces. The framework comprises 37 scenarios across 8 TM Forum Open API domains, using mock API servers and deterministic evaluation rubrics. Results show that augmenting LLMs with a SKILL.md document encoding domain knowledge consistently improves performance across multiple open-weight models, with MiniMax M2.5 achieving the highest score (81.1%).
Injecting structured domain knowledge into LLMs boosts their ability to reliably execute telecommunications operations workflows by up to 18.9 percentage points.
As telecommunications operators accelerate adoption of AI-enabled automation, a practical question remains unresolved: can general-purpose large language model (LLM) agents reliably execute telecom operations workflows through real API interfaces, or do they require structured domain guidance? We introduce SKILLS (Structured Knowledge Injection for LLM-driven Service Lifecycle operations), a benchmark framework comprising 37 telecom operations scenarios spanning 8 TM Forum Open API domains (TMF620, TMF621, TMF622, TMF628, TMF629, TMF637, TMF639, TMF724). Each scenario is grounded in live mock API servers with seeded production-representative data, MCP tool interfaces, and deterministic evaluation rubrics combining response content checks, tool-call verification, and database state assertions. We evaluate open-weight models under two conditions: baseline (generic agent with tool access but no domain guidance) and with-skill (agent augmented with a portable SKILL.md document encoding workflow logic, API patterns, and business rules). Results across 5 open-weight model conditions and 185 scenario-runs show consistent skill lift across all models. MiniMax M2.5 leads (81.1% with-skill, +13.5pp), followed by Nemotron 120B (78.4%, +18.9pp), GLM-5 Turbo (78.4%, +5.4pp), and Seed 2.0 Lite (75.7%, +18.9pp).