Search papers, labs, and topics across Lattice.
The paper introduces HeRo, a framework designed to optimize the deployment of agentic retrieval-augmented generation (RAG) workflows on heterogeneous mobile System-on-Chips (SoCs). HeRo uses profiling-based performance models to capture latency, workload shape, and contention-induced slowdowns for different sub-stages and model-PU configurations. By integrating shape-aware sub-stage partitioning, criticality-based accelerator mapping, and bandwidth-aware concurrency control within a lightweight online scheduler, HeRo achieves significant latency reductions in end-to-end RAG execution.
Achieve up to 10.94x speedup in end-to-end latency for on-device agentic RAG by intelligently scheduling tasks across heterogeneous mobile SoC hardware.
With the increasing computational capability of mobile devices, deploying agentic retrieval-augmented generation (RAG) locally on heterogeneous System-on-Chips (SoCs) has become a promising way to enhance LLM-based applications. However, agentic RAG induces multi-stage workflows with heterogeneous models and dynamic execution flow, while mobile SoCs exhibit strong accelerator affinity, shape sensitivity, and shared-memory bandwidth contention, making naive scheduling ineffective. We present HeRo, a heterogeneous-aware framework for low-latency agentic RAG on mobile SoCs. HeRo builds profiling-based performance models for each sub-stage and model-PU configuration, capturing latency, workload shape, and contention-induced slowdown, and leverages them in a lightweight online scheduler that combines shape-aware sub-stage partitioning, criticality-based accelerator mapping, and bandwidth-aware concurrency control. Experiments on commercial mobile devices show that HeRo reduces end-to-end latency by up to $10.94\times$ over existing deployment strategies, enabling practical on-device agentic RAG.