Search papers, labs, and topics across Lattice.
The paper addresses the dilemma that task-oriented dialogue (TOD) systems struggle with new APIs while language agents (LAs) lack multi-turn conversation skills. To overcome this, the authors introduce CoALM (Conversational Agentic Language Model), a unified approach trained on a novel multi-task dataset, CoALM-IT, which interleaves multi-turn ReAct reasoning with complex API usage. Experiments show that CoALM models (8B, 70B, and 405B) outperform specialized models, including GPT-4o, on MultiWOZ 2.4, BFCL V3, and API-Bank benchmarks, demonstrating the effectiveness of a single model for both TOD and LA.
Forget specialized models: CoALM proves a single LLM can now master both multi-turn conversations *and* complex tool use, even outperforming GPT-4o.
Large Language Models (LLMs) with API-calling capabilities enabled building effective Language Agents (LA), while also revolutionizing the conventional task-oriented dialogue (TOD) paradigm. However, current approaches face a critical dilemma: TOD systems are often trained on a limited set of target APIs, requiring new data to maintain their quality when interfacing with new services, while LAs are not trained to maintain user intent over multi-turn conversations. Because both robust multi-turn management and advanced function calling are crucial for effective conversational agents, we evaluate these skills on three popular benchmarks: MultiWOZ 2.4 (TOD), BFCL V3 (LA), and API-Bank (LA), and our analyses reveal that specialized approaches excel in one domain but underperform in the other. To bridge this chasm, we introduce CoALM (Conversational Agentic Language Model), a unified approach that integrates both conversational and agentic capabilities. We created CoALM-IT, a carefully constructed multi-task dataset that interleave multi-turn ReAct reasoning with complex API usage. Using CoALM-IT, we train three models CoALM 8B, CoALM 70B, and CoALM 405B, which outperform top domain-specific models, including GPT-4o, across all three benchmarks. This demonstrates the feasibility of a single model approach for both TOD and LA, setting a new standard for conversational agents.