Feb 18, 2026arXiv:2602.16760

Privacy-Aware Split Inference with Speculative Decoding for Large Language Models over Wide-Area Networks

AI Summary

This paper introduces a split inference system for LLMs that partitions the transformer model between a local trusted GPU and a remote untrusted GPU, communicating only intermediate activations to preserve privacy. The system employs an asymmetric layer split, keeping embedding/unembedding layers local, and introduces lookahead decoding to amortize WAN latency. Experiments on Mistral 7B and NeMo 12B demonstrate the system's effectiveness, achieving 8.7-9.3 tok/s and 7.8-8.7 tok/s respectively over an 80ms WAN, while also evaluating the privacy-performance tradeoff via inversion attacks.

Key Contribution

Running LLMs privately on your laptop without sacrificing speed is now practical: split inference and lookahead decoding can deliver near-native throughput even over high-latency networks.

Abstract

We present a practical system for privacy-aware large language model (LLM) inference that splits a transformer between a trusted local GPU and an untrusted cloud GPU, communicating only intermediate activations over the network. Our system addresses the unique challenges of autoregressive LLM decoding over high-latency wide-area networks (WANs), contributing: (1) an asymmetric layer split where embedding and unembedding layers remain local, ensuring raw tokens never leave the trusted device; (2) the first application of lookahead decoding to split inference over WANs, amortizing network round-trip latency across multiple tokens per iteration; (3) an empirical inversion attack evaluation showing that split depth provides a tunable privacy-performance tradeoff -- an attacker can recover ~59%% of tokens at a 2-layer split but only ~35%% at an 8-layer split, with minimal throughput impact; (4) ablation experiments showing that n-gram speculation accepts 1.2-1.3 tokens per decoding step on average (peak of 7 observed on code), with acceptance rates consistent across model scales; (5) formal verification that lookahead decoding produces token-identical output to sequential decoding under greedy argmax, with zero quality degradation; and (6) scaling validation on Mistral NeMo 12B (40 layers), demonstrating that the system generalizes to larger models with only 4.9 GB local VRAM and matching 7B throughput. Evaluated on Mistral 7B and NeMo 12B over a ~80ms WAN link, our system achieves 8.7-9.3 tok/s (7B) and 7.8-8.7 tok/s (12B) with lookahead decoding, with an RTT decomposition model (validated at <6.2%% cross-validation error) projecting 15-19 tok/s at 20ms RTT.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Privacy-Aware Split Inference with Speculative Decoding for Large Language Models over Wide-Area Networks

Related Papers