Search papers, labs, and topics across Lattice.
Heavy supervised fine-tuning on a specific domain can severely degrade general tool-use capabilities in large language models, even to near-zero performance. However, fine-tuning a domain-specialized model on a small dataset of agentic traces from a *different* domain can reactivate these dormant tool-use skills. Experiments with Goedel-Prover-V2 show that just 100 Lean-specific agentic traces restore strong tool-calling behavior, improving performance on the Berkeley Function Calling Leaderboard from near zero to 83.8% and ProofNet from 21.51% to 25.81%.
Domain-specific fine-tuning can induce "agentic collapse" in LLMs, but a surprisingly small amount of agentic data from *another* domain can bring those general tool-use skills roaring back.
Heavy supervised fine-tuning on a target domain can strongly suppress capabilities that were present in the base model. We study this phenomenon in formal mathematics using Goedel-Prover-V2, an open-source model heavily trained on 1.8 million formal-math examples. After domain specialization, the model almost completely loses its ability to produce valid tool calls, even when explicitly instructed to use tools, dropping from 89.4% function-calling accuracy in the base model to nearly 0%. We ask whether this agentic collapse is permanent or instead reversible. To answer this question, we fine-tune the specialized model on a small amount of Lean-specific tool-use data. Remarkably, as few as 100 agentic traces are sufficient to restore strong tool-calling behavior. Importantly, this recovery is not the result of reward hacking or benchmark-specific optimization: the recovery data is entirely drawn from the Lean setting, where the model uses natural-language queries to search the Mathlib library for relevant theorems and lemmas, yet the regained capability transfers well beyond that domain. In particular, these same 100 Lean-specific traces improve performance on the Berkeley Function Calling Leaderboard from near zero to 83.8%, approaching the base model's 89.4% despite the mismatch in task distribution and protocol. The recovered capability is also practically useful in-domain. On ProofNet, pass@32 improves from 21.51% to 25.81%. Together, these results show that heavy domain supervised fine-tuning can suppress general tool-use ability without permanently erasing it, and that a small amount of domain-specific agentic data can awaken dormant tool-use capabilities.