Search papers, labs, and topics across Lattice.
This paper investigates the feasibility of using SmartNIC DPUs to offload asynchronous "fire-and-forget" communication tasks from the host CPU. They design and implement "Buddy," a communication offloading engine that runs on Nvidia BlueField-3 DPUs and x86 CPUs. Results across five applications show that offloading communication to the DPU yields up to 1.55x speedup for host-dominated workloads, but also reveals a 625x increase in DRAM traffic due to the DPU's lack of Direct Cache Access, indicating a critical design bottleneck.
Offloading communication to SmartNIC DPUs can speed up host-dominated workloads by 1.55x, but the lack of Direct Cache Access creates a massive DRAM bottleneck.
SmartNIC Data Processing Units (DPUs) offer a promising solution for saving high-end CPU resources by offloading tasks to programmable cores near the network interface. In this work, we explore the feasibility of SmartNIC DPUs in supporting an asynchronous communication model called"fire-and-forget", particularly its core message routing service. We design a communication offloading engine called Buddy that decouples communication tasks from the application process. Buddy runs flexibly on SmartNIC DPUs such as the Nvidia BlueField-3 DPU and generic x86 CPUs. Our evaluation results in five applications identify the memory-to-communication ratio as a key predictor of the offloading performance. Host-dominated workloads, such as Quicksilver and Sparse Matrix Transpose, achieved up to 1.55x speedup with communication offloaded to the DPU. We further identify a 625x increase in DRAM traffic due to the absence of Direct Cache Access support on the DPU, highlighting a critical need in future SmartNIC designs.