Search papers, labs, and topics across Lattice.
This paper introduces a communication-efficient collaborative inference scheme for deploying large language models (LLMs) across LEO satellite networks by splitting the model into sub-models and distributing them across satellites. To minimize inference delay, they employ pipeline parallelism to overlap sub-model inference with intermediate activation transmission and introduce an adaptive activation compression scheme to preserve accuracy. By jointly optimizing model splitting and compression ratios, the proposed scheme achieves up to 42% reduction in inference delay and 71% reduction in communication overhead compared to benchmarks, while maintaining inference accuracy.
Achieve 42% faster LLM inference on satellite networks by cleverly splitting models and compressing activations for efficient communication.
Low Earth orbit (LEO) satellites play an essential role in intelligent Earth observation by leveraging artificial intelligence models. However, limited onboard memory and excessive inference delay prevent the practical deployment of large language models (LLMs) on a single satellite. In this paper, we propose a communication-efficient collaborative LLM inference scheme for LEO satellite networks. Specifically, the entire LLM is split into multiple sub-models, with each deployed on a satellite, thereby enabling collaborative LLM inference via exchanging intermediate activations between satellites. The proposed scheme also leverages the pipeline parallelism mechanism that overlaps sub-model inference with intermediate activation transmission, thereby reducing LLM inference delay. An adaptive activation compression scheme is designed to mitigate cumulative errors from multi-stage model splitting while preserving inference accuracy. Furthermore, we formulate the LLM inference delay minimization problem by jointly optimizing model splitting and compression ratios under onboard memory and inference accuracy constraints. The problem is transformed into a shortest-path search problem over a directed acyclic graph that edge weights explicitly quantify the inference delay induced by model splitting and compression strategies, which is solved via a modified A Star-based search algorithm. Extensive simulation results indicate that the proposed solution can reduce inference delay by up to 42% and communication overhead by up to 71% compared to state-of-the-art benchmarks, while maintaining the inference accuracy loss of less than 1%.