Search papers, labs, and topics across Lattice.
The paper introduces SpecHop, a continuous speculation framework to reduce latency in multi-hop tool-use scenarios for LLMs by predicting tool observations. SpecHop maintains multiple speculative threads, asynchronously verifies predictions against actual tool outputs, and commits or rolls back threads to preserve accuracy. Experiments on retrieval-augmented multi-hop tasks show SpecHop achieves up to 40% latency reduction, closely matching theoretical predictions of optimal latency gain.
LLMs can slash multi-hop retrieval latency by 40% using SpecHop, a framework that speculatively executes multiple reasoning paths and rolls back incorrect ones.
Large language models increasingly use external tools such as web search and document retrieval to solve information-intensive tasks. However, multi-hop tool use in complex tasks introduces substantial latency, since the model must repeatedly wait for tool observations before continuing. We study how to accelerate such trajectories without changing the final trajectory the model would have taken without acceleration, assuming access to faster but less reliable speculator tools. We develop a theoretical framework for lossless speculation in multi-hop tool-use settings, characterizing the optimal achievable latency gain. We propose SpecHop, a continuous speculation framework that maintains multiple speculative threads, verifies predicted observations asynchronously as target tool outputs arrive, commits correct branches, and rolls back incorrect ones. This preserves accuracy while reducing wall-clock latency. We show that SpecHop can approach oracle latency gains with enough active threads. Empirically, on retrieval-augmented multi-hop tasks, SpecHop closely matches theoretical predictions and reduces latency by up to 40\% in some settings. Code: https://github.com/mehrdadsaberi/spechop