Search papers, labs, and topics across Lattice.
This paper presents the first end-to-end Retrieval-Augmented Generation (RAG) pipeline that operates entirely on-device using the Qualcomm Hexagon NPU of the Snapdragon X Elite, addressing the significant energy costs associated with CPU inference. Benchmarking results show that the NPU achieves a remarkable 9.1x higher embedding throughput and 12.3x lower system energy during indexing, while also delivering 18.1x faster LLM prefilling and 4.0x lower query latency compared to CPU baselines. Importantly, the quality of answers generated by the NPU is comparable to those from CPU and GPU systems, indicating a viable path for energy-efficient, high-performance on-device AI applications.
Achieving 18.1x faster LLM generation with 4.0x less energy on-device could redefine the landscape of mobile AI applications.
Retrieval-Augmented Generation (RAG) pipelines are compute-intensive, combining embedding, retrieval, reranking, and large language model (LLM) generation. Running them entirely on-device benefits privacy, latency, and offline use, but the energy cost of CPU inference is a major barrier. We present what is, to our knowledge, the first end-to-end RAG pipeline that runs all neural stages -- embedding, reranking, and LLM generation -- on the Qualcomm Hexagon NPU of the Snapdragon X Elite. Profiling on a Dell XPS 13 laptop, we compare NPU-accelerated RAG against CPU and OpenCL/Adreno GPU baselines on indexing and query workloads. On indexing, the NPU achieves 9.1x higher embedding throughput and 12.3x less system energy. On a 120-query Wikipedia-passage benchmark, it delivers 18.1x faster LLM prefilling, 4.0x lower end-to-end query latency, and 4.0x less system energy than the CPU baseline; the same workload on the integrated GPU is 1.7x slower than CPU and uses 6.5x more energy than the NPU. A GPT-4.1 LLM-as-judge evaluation finds NPU answer quality on par with CPU and GPU within evaluator noise (mean 9.32 vs. 8.95 vs. 9.03 on a 1-10 rubric), with 86.7% of queries scoring identically across all three backends. On the Snapdragon X Elite / Hexagon class of laptop SoC, the NPU thus enables practical, energy-efficient on-device RAG without quality regression -- a sustainable path toward green edge intelligence that we expect to generalize to comparable mobile NPUs (Apple Neural Engine, Intel NPU, MediaTek APU) as their software stacks mature.