Pengcheng LaboratoryApr 28, 2026arXiv:2604.25777

SpecFed: Accelerating Federated LLM Inference with Speculative Decoding and Compressed Transmission

Ce Zheng, Xinghan Wang, Jiahong Ning, Yuxuan Shi, Ningjing Huang, Ning Huang, Tingting Yang

AI Summary

The paper introduces SpecFed, a federated inference framework for LLMs that combines speculative decoding with compressed transmission of top-K token probabilities to accelerate distributed LLM inference. SpecFed reduces communication overhead by transmitting only the top-K token probabilities from each worker and employs server-side reconstruction strategies to approximate the full probability distribution. Empirical results demonstrate that SpecFed maintains high generation fidelity while significantly improving decoding throughput in federated settings.

Key Contribution

Federated LLM inference gets a speed boost: SpecFed's speculative decoding and compressed communication slashes latency without sacrificing generation quality.

Abstract

Federated inference enhances LLM performance in edge computing through weighted averaging of distributed model predictions. However, autoregressive LLM inference requires frequent full-model forward passes across workers, severely limiting decoding throughput. Distributed deployment further aggravates this due to a communication bottleneck: each worker must transmit full token probability distributions per draft token, dominating end-to-end latency. To address these challenges, we introduce speculative decoding to enable parallel LLM processing and propose a top-K compressed transmission scheme with two server-side reconstruction strategies. We theoretically analyze the robustness of our method in terms of local reconstruction error, aggregation bias, and acceptance-rate bias, and derive corresponding bounds. Experiments demonstrate that our scheme achieves high generation fidelity while significantly reducing communication overhead.

Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References25

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SpecFed: Accelerating Federated LLM Inference with Speculative Decoding and Compressed Transmission

Related Papers