Search papers, labs, and topics across Lattice.
This paper introduces SkillResolve-Bench, a benchmark designed to evaluate and mitigate same-capability ambiguity in agent skill retrieval, where a retrieved skill may inadvertently lead to risky execution paths. The study reveals that existing retrieval methods can expose agents to harmful siblings鈥攕kills that share a capability family but are unsuitable for specific queries鈥攈ighlighting the need for a more nuanced approach to skill selection. SkillResolve, the proposed method, significantly improves retrieval performance, achieving a Recall@3 of 0.766 and an NDCG@3 of 0.699 while eliminating harmful sibling exposure, demonstrating the importance of representative selection in enhancing safety in skill retrieval.
Identifying and resolving same-capability ambiguity can drastically improve agent safety, with SkillResolve achieving a remarkable 0% harmful sibling exposure.
Agent skill libraries are becoming routable software assets: a retrieved skill can contribute instructions, scripts, resource bindings, and execution assumptions to an agent. This makes skill retrieval more than broad relevance matching. A retriever can find the right capability family yet expose the wrong same-capability representative. We study this failure as same-capability execution-risk retrieval. Each query pairs a helpful skill with a query-specific risky sibling that shares the capability family but can lead execution toward a stale resource, missing precondition, or wrong procedure. We introduce SkillResolve-Bench 1.0, an auditable benchmark for this setting with 661 helpful/risky pairs, source-role and admission evidence, cue/leakage checks, query-disjoint splits, and a 7,982-candidate pool that includes 6,660 public SkillRet candidates. The benchmark reports helpful ranking together with harmful sibling rate (HSR@K), the top-K exposure of the risky sibling. We also provide SkillResolve, a reference method that resolves active candidate families, scores query-conditioned utility from confusable library negatives and contract-profile cues, and selects one representative from each family before the final top-K list. Under the released family relation, SkillResolve reaches Recall@3 0.766 and NDCG@3 0.699 while keeping HSR@3=0. It improves over SkillRouter by 0.112 Recall@3 and 0.165 NDCG@3 while reducing HSR@3 from 0.693 to 0. Without representative selection, HSR@3 rises to 0.236 under the same scorer, identifying within-family representative choice as the mechanism that turns capability retrieval into safer procedural exposure.