Search papers, labs, and topics across Lattice.
This survey paper examines lightweight transformer architectures optimized for deployment on resource-constrained edge devices, focusing on model compression, quantization, pruning, and knowledge distillation techniques. It benchmarks prominent lightweight variants like MobileBERT, TinyBERT, and EfficientFormer on datasets like GLUE and ImageNet-1K, evaluating their performance across different hardware platforms and deployment frameworks. The analysis identifies sparse attention, mixed-precision quantization, and hardware-aware NAS as effective optimization strategies, demonstrating that lightweight transformers can achieve significant reductions in model size and inference latency with minimal accuracy loss.
Achieve near-full accuracy (75-96%) with lightweight transformers on edge devices, slashing model size by 4-10x and inference latency by 3-9x.
The deployment of transformer-based models on resource-constrained edge devices represents a critical challenge in enabling real-time artificial intelligence applications. This comprehensive survey examines lightweight transformer architectures specifically designed for edge deployment, analyzing recent advances in model compression, quantization, pruning, and knowledge distillation techniques. We systematically review prominent lightweight variants including MobileBERT, TinyBERT, DistilBERT, EfficientFormer, EdgeFormer, and MobileViT, providing detailed performance benchmarks on standard datasets such as GLUE, SQuAD, ImageNet-1K, and COCO. Our analysis encompasses current industry adoption patterns across major hardware platforms (NVIDIA Jetson, Qualcomm Snapdragon, Apple Neural Engine, ARM architectures), deployment frameworks (TensorFlow Lite, ONNX Runtime, PyTorch Mobile, CoreML), and optimization strategies. Experimental results demonstrate that modern lightweight transformers can achieve 75-96% of full-model accuracy while reducing model size by 4-10x and inference latency by 3-9x, enabling deployment on devices with as little as 2-5W power consumption. We identify sparse attention mechanisms, mixed-precision quantization (INT8/FP16), and hardware-aware neural architecture search as the most effective optimization strategies. Novel findings include memory-bandwidth bottleneck analysis revealing 15-40M parameter models achieve optimal hardware utilization (60-75% efficiency), quantization sweet spots for different model types, and comprehensive energy efficiency profiling across edge platforms. We establish real-time performance boundaries and provide a practical 6-step deployment pipeline achieving 8-12x size reduction with less than 2% accuracy degradation.