MilaMar 16, 2026arXiv:2603.15954

MobileLLM-Flash: Latency-Guided On-Device LLM Design for Industry Scale

Igor Fedorov, Andrey Gromov, B. Beckerman, Naveen Suda, David Eriksson, M. Balandat, Rylan Conway, Patrick Huber, Chinnadhurai Sankar, Ayushi Dalmia, Zechun Liu, Tarek Elgamal, Adithya Sagar, Vikas Chandra, Raghuraman Krishnamoorthi

AI Summary

The authors introduce a hardware-in-the-loop architecture search methodology to design on-device LLMs optimized for mobile latency constraints and compatibility with standard mobile runtimes. They avoid custom kernels and specialized attention mechanisms, instead using attention skipping and jointly optimizing model architecture and attention patterns. By treating candidates as pruned versions of a pretrained backbone with inherited weights, they achieve high accuracy with minimal continued pretraining, resulting in the MobileLLM-Flash family of models (350M, 650M, 1.4B) that achieve up to 1.8x and 1.6x faster prefill and decode on mobile CPUs.

Key Contribution

Forget exotic attention mechanisms – MobileLLM-Flash achieves up to 1.8x faster LLM prefill on mobile CPUs by smartly pruning and adapting existing architectures for on-device use.

Abstract

Real-time AI experiences call for on-device large language models (OD-LLMs) optimized for efficient deployment on resource-constrained hardware. The most useful OD-LLMs produce near-real-time responses and exhibit broad hardware compatibility, maximizing user reach. We present a methodology for designing such models using hardware-in-the-loop architecture search under mobile latency constraints. This system is amenable to industry-scale deployment: it generates models deployable without custom kernels and compatible with standard mobile runtimes like Executorch. Our methodology avoids specialized attention mechanisms and instead uses attention skipping for long-context acceleration. Our approach jointly optimizes model architecture (layers, dimensions) and attention pattern. To efficiently evaluate candidates, we treat each as a pruned version of a pretrained backbone with inherited weights, thereby achieving high accuracy with minimal continued pretraining. We leverage the low cost of latency evaluation in a staged process: learning an accurate latency model first, then searching for the Pareto-frontier across latency and quality. This yields MobileLLM-Flash, a family of foundation models (350M, 650M, 1.4B) for efficient on-device use with strong capabilities, supporting up to 8k context length. MobileLLM-Flash delivers up to 1.8x and 1.6x faster prefill and decode on mobile CPUs with comparable or superior quality. Our analysis of Pareto-frontier design choices offers actionable principles for OD-LLM design.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References37

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MobileLLM-Flash: Latency-Guided On-Device LLM Design for Industry Scale

Related Papers