Search papers, labs, and topics across Lattice.
The paper introduces SpecFed, a federated inference framework for LLMs that combines speculative decoding with compressed transmission of top-K token probabilities to accelerate distributed LLM inference. SpecFed reduces communication overhead by transmitting only the top-K token probabilities from each worker and employs server-side reconstruction strategies to approximate the full probability distribution. Empirical results demonstrate that SpecFed maintains high generation fidelity while significantly improving decoding throughput in federated settings.
Federated LLM inference gets a speed boost: SpecFed's speculative decoding and compressed communication slashes latency without sacrificing generation quality.
Federated inference enhances LLM performance in edge computing through weighted averaging of distributed model predictions. However, autoregressive LLM inference requires frequent full-model forward passes across workers, severely limiting decoding throughput. Distributed deployment further aggravates this due to a communication bottleneck: each worker must transmit full token probability distributions per draft token, dominating end-to-end latency. To address these challenges, we introduce speculative decoding to enable parallel LLM processing and propose a top-K compressed transmission scheme with two server-side reconstruction strategies. We theoretically analyze the robustness of our method in terms of local reconstruction error, aggregation bias, and acceptance-rate bias, and derive corresponding bounds. Experiments demonstrate that our scheme achieves high generation fidelity while significantly reducing communication overhead.