Microsoft ResearchImperialFeb 18, 2026arXiv:2602.16485

Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling

Jeffrey T. H. Wong, Jeffrey T. H. Wong, Zixi Zhang, Zixi Zhang, Junyi Liu, Junyi Liu, Yiren Zhao, Yiren Zhao

AI Summary

The paper introduces Team-of-Thoughts, a multi-agent system architecture that dynamically leverages heterogeneous agents with specialized tool-calling capabilities, addressing the limitations of static, homogeneous MAS configurations. It optimizes performance through orchestrator calibration to identify superior coordination models and a self-assessment protocol for tool agents to profile their domain expertise. Experiments on reasoning and code generation benchmarks demonstrate that Team-of-Thoughts significantly outperforms homogeneous baselines, achieving substantial accuracy gains on AIME24 and LiveCodeBench.

Key Contribution

Forget static, homogeneous multi-agent systems: Team-of-Thoughts unlocks superior performance by dynamically orchestrating heterogeneous agents based on calibrated coordination and self-assessed domain expertise.

Abstract

Existing Multi-Agent Systems (MAS) typically rely on static, homogeneous model configurations, limiting their ability to exploit the distinct strengths of differently post-trained models. To address this, we introduce Team-of-Thoughts, a novel MAS architecture that leverages the complementary capabilities of heterogeneous agents via an orchestrator-tool paradigm. Our framework introduces two key mechanisms to optimize performance: (1) an orchestrator calibration scheme that identifies models with superior coordination capabilities, and (2) a self-assessment protocol where tool agents profile their own domain expertise to account for variations in post-training skills. During inference, the orchestrator dynamically activates the most suitable tool agents based on these proficiency profiles. Experiments on five reasoning and code generation benchmarks show that Team-of-Thoughts delivers consistently superior task performance. Notably, on AIME24 and LiveCodeBench, our approach achieves accuracies of 96.67% and 72.53%, respectively, substantially outperforming homogeneous role-play baselines, which score 80% and 65.93%.

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References19

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling

Related Papers