Search papers, labs, and topics across Lattice.
This paper introduces CAGR, a cross-accelerator graph optimization framework designed to improve the inference performance of deep learning recommendation models (DLRMs) across diverse hardware platforms. CAGR uses a hardware-aware graph rewriting engine guided by reinforcement learning to dynamically select optimal operator implementations based on compute-to-memory bandwidth ratios and operator density. Experiments on the Avazu dataset demonstrate that CAGR achieves 1.8-3.2x speedup compared to baseline implementations on NVIDIA V100, AMD MI100, and Google TPU v3, while also reducing optimization time.
Get near-peak performance for your recommender system across GPUs and TPUs without tedious platform-specific tuning, thanks to a new cross-accelerator graph optimization framework.
Recommender systems have become ubiquitous in modern online services, yet their deployment across diverse hardware accelerators remains challenging due to significant performance variations. Contemporary deep learning recommendation models (DLRMs), such as DeepFM and NGCF, exhibit substantial inference latency differences when executed on NVIDIA GPUs, AMD GPUs, and Google TPUs, primarily due to architectural disparities and vendor-specific optimization strategies. Existing graph optimization frameworks are typically designed for specific hardware backends, lacking the flexibility to generate portable high-performance implementations across heterogeneous accelerators. This paper presents CAGR (Cross-Accelerator Graph Rewriting), a novel framework that achieves performance-portable inference optimization for recommendation models through three key innovations: 1) hardware-aware graph rewriting engine that dynamically selects optimal operator implementations by analyzing compute-to-memory bandwidth ratios and operator density characteristics; 2) reinforcement learning-based transformation policy that learns cross-platform optimization strategies without exhaustive search; and 3) heterogeneous pipeline architecture enabling “optimize once, deploy across supported backends” semantics. We implement CAGR with support for multiple kernel backends, including Triton, cuBLAS, MIOpen, and oneDNN, and demonstrate its effectiveness on the Avazu CTR prediction dataset. Experimental results show that CAGR achieves 1.8- $3.2\times $ speedup over baseline implementations across NVIDIA V100, AMD MI100, and Google TPU v3 platforms, while reducing optimization time by 67% compared to platform-specific auto-tuning approaches. Furthermore, CAGR maintains 92-96% of reference optimized performance with zero manual intervention, demonstrating practical viability for production recommendation systems requiring multi-vendor deployment.