Search papers, labs, and topics across Lattice.
This paper introduces a split inference system for LLMs that partitions the transformer model between a local trusted GPU and a remote untrusted GPU, communicating only intermediate activations to preserve privacy. The system employs an asymmetric layer split, keeping embedding/unembedding layers local, and introduces lookahead decoding to amortize WAN latency. Experiments on Mistral 7B and NeMo 12B demonstrate the system's effectiveness, achieving 8.7-9.3 tok/s and 7.8-8.7 tok/s respectively over an 80ms WAN, while also evaluating the privacy-performance tradeoff via inversion attacks.
Running LLMs privately on your laptop without sacrificing speed is now practical: split inference and lookahead decoding can deliver near-native throughput even over high-latency networks.
We present a practical system for privacy-aware large language model (LLM) inference that splits a transformer between a trusted local GPU and an untrusted cloud GPU, communicating only intermediate activations over the network. Our system addresses the unique challenges of autoregressive LLM decoding over high-latency wide-area networks (WANs), contributing: (1) an asymmetric layer split where embedding and unembedding layers remain local, ensuring raw tokens never leave the trusted device; (2) the first application of lookahead decoding to split inference over WANs, amortizing network round-trip latency across multiple tokens per iteration; (3) an empirical inversion attack evaluation showing that split depth provides a tunable privacy-performance tradeoff -- an attacker can recover ~59%% of tokens at a 2-layer split but only ~35%% at an 8-layer split, with minimal throughput impact; (4) ablation experiments showing that n-gram speculation accepts 1.2-1.3 tokens per decoding step on average (peak of 7 observed on code), with acceptance rates consistent across model scales; (5) formal verification that lookahead decoding produces token-identical output to sequential decoding under greedy argmax, with zero quality degradation; and (6) scaling validation on Mistral NeMo 12B (40 layers), demonstrating that the system generalizes to larger models with only 4.9 GB local VRAM and matching 7B throughput. Evaluated on Mistral 7B and NeMo 12B over a ~80ms WAN link, our system achieves 8.7-9.3 tok/s (7B) and 7.8-8.7 tok/s (12B) with lookahead decoding, with an RTT decomposition model (validated at <6.2%% cross-validation error) projecting 15-19 tok/s at 20ms RTT.