Search papers, labs, and topics across Lattice.
The paper introduces ToolSelect, a method for adaptively selecting the best task-specialized model from a pool of candidates in agentic healthcare systems. It addresses the challenge that different models excel on different data samples by learning model selection via minimizing a population risk over sampled specialist tools using a surrogate of the task-conditional selection loss. The authors demonstrate ToolSelect's effectiveness on a newly introduced Chest X-ray environment (ToolSelectBench) with diverse task-specialized models, showing it outperforms 10 state-of-the-art methods across disease detection, report generation, visual grounding, and VQA tasks.
Stop relying on a single "best" model: ToolSelect adaptively picks the right specialist model for each healthcare query, outperforming SOTA methods by learning from model behavior.
Task-specialized models form the backbone of agentic healthcare systems, enabling the agents to answer clinical queries across tasks such as disease diagnosis, localization, and report generation. Yet, for a given task, a single "best" model rarely exists. In practice, each task is better served by multiple competing specialist models where different models excel on different data samples. As a result, for any given query, agents must reliably select the right specialist model from a heterogeneous pool of tool candidates. To this end, we introduce ToolSelect, which adaptively learns model selection for tools by minimizing a population risk over sampled specialist tool candidates using a consistent surrogate of the task-conditional selection loss. Concretely, we propose an Attentive Neural Process-based selector conditioned on the query and per-model behavioral summaries to choose among the specialist models. Motivated by the absence of any established testbed, we, for the first time, introduce an agentic Chest X-ray environment equipped with a diverse suite of task-specialized models (17 disease detection, 19 report generation, 6 visual grounding, and 13 VQA) and develop ToolSelectBench, a benchmark of 1448 queries. Our results demonstrate that ToolSelect consistently outperforms 10 SOTA methods across four different task families.