Microsoft ResearchBrandeisGlasgowUSCMar 9, 2026

CAGR: A Cross-Accelerator Graph Optimization Framework for Efficient Recommender System Inference

Zijian Shen, Wenyu Zhao, Boyuan Wang, Zimeng Wang, Wenbin Shang

AI Summary

This paper introduces CAGR, a cross-accelerator graph optimization framework designed to improve the inference performance of deep learning recommendation models (DLRMs) across diverse hardware platforms. CAGR employs a hardware-aware graph rewriting engine, a reinforcement learning-based transformation policy, and a heterogeneous pipeline architecture to dynamically select optimal operator implementations. Experiments on the Avazu dataset demonstrate that CAGR achieves 1.8-3.2x speedup over baseline implementations on NVIDIA V100, AMD MI100, and Google TPU v3, while also reducing optimization time.

Key Contribution

Achieve near-optimal DLRM inference speedups across diverse hardware (NVIDIA, AMD, TPU) with a single optimization pass, eliminating the need for vendor-specific tuning.

Abstract

Recommender systems have become ubiquitous in modern online services, yet their deployment across diverse hardware accelerators remains challenging due to significant performance variations. Contemporary deep learning recommendation models (DLRMs), such as DeepFM and NGCF, exhibit substantial inference latency differences when executed on NVIDIA GPUs, AMD GPUs, and Google TPUs, primarily due to architectural disparities and vendor-specific optimization strategies. Existing graph optimization frameworks are typically designed for specific hardware backends, lacking the flexibility to generate portable high-performance implementations across heterogeneous accelerators. This paper presents CAGR (Cross-Accelerator Graph Rewriting), a novel framework that achieves performance-portable inference optimization for recommendation models through three key innovations: 1) hardware-aware graph rewriting engine that dynamically selects optimal operator implementations by analyzing compute-to-memory bandwidth ratios and operator density characteristics; 2) reinforcement learning-based transformation policy that learns cross-platform optimization strategies without exhaustive search; and 3) heterogeneous pipeline architecture enabling “optimize once, deploy across supported backends” semantics. We implement CAGR with support for multiple kernel backends, including Triton, cuBLAS, MIOpen, and oneDNN, and demonstrate its effectiveness on the Avazu CTR prediction dataset. Experimental results show that CAGR achieves 1.8- $3.2\times $ speedup over baseline implementations across NVIDIA V100, AMD MI100, and Google TPU v3 platforms, while reducing optimization time by 67% compared to platform-specific auto-tuning approaches. Furthermore, CAGR maintains 92-96% of reference optimized performance with zero manual intervention, demonstrating practical viability for production recommendation systems requiring multi-vendor deployment.

Distributed Systems & Hardware Inference & Quantization Recommendation & Information Retrieval

Citation Metrics

Citations2

Influential citations0

References47

Year2026

VenueIEEE Access

Related Papers

Finding related papers...

Search

CAGR: A Cross-Accelerator Graph Optimization Framework for Efficient Recommender System Inference

Related Papers