UCSCMar 12, 2026arXiv:2603.11438

NCCLbpf: Verified, Composable Policy Execution for GPU Collective Communication

AI Summary

NCCLbpf introduces a verified extension framework for NCCL, the standard for GPU collective communication, by embedding a userspace eBPF runtime into NCCL's plugin interfaces. This approach enables load-time static verification to prevent unsafe plugin execution, composable policies via structured cross-plugin maps, and atomic policy hot-reloads. Evaluations on NVIDIA B300 GPUs show minimal overhead (80-130 ns per tuner decision) and a message-size-aware eBPF policy that improves AllReduce throughput by up to 27% in the 4-128 MiB range.

Key Contribution

Hot-patching NCCL with eBPF lets you boost AllReduce throughput by 27% *and* verify plugin safety, all without modifying NCCL itself.

Abstract

NCCL is the de facto standard for collective GPU communication in large-scale distributed training, relying heavily on plugins to customize runtime behavior. However, these plugins execute as unverified native code within NCCL's address space, risking job crashes, silent state corruption, and downtime from restarts during policy updates. Inspired by kernel extensibility models, we introduce NCCLbpf, a verified, high-performance extension framework embedding a userspace eBPF runtime directly into NCCL's existing plugin interfaces, without modifying NCCL itself. NCCLbpf offers load-time static verification to prevent unsafe plugin execution, structured cross-plugin maps enabling composable policies and closed-loop adaptation, and atomic policy hot-reloads eliminating downtime previously required for policy updates. Evaluations on 8x NVIDIA B300 GPUs connected via NVLink demonstrate that NCCLbpf imposes just 80-130 ns overhead per tuner decision (less than 0.03% of collective latency), prevents all tested unsafe plugin behaviors at load-time, and enables a message-size-aware eBPF policy that improves AllReduce throughput by up to 27% over NCCL's default in the 4-128 MiB range.

Code Generation & Program Synthesis Distributed Systems & Hardware

Citation Metrics

Citations0

Influential citations0

References26

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

NCCLbpf: Verified, Composable Policy Execution for GPU Collective Communication

Related Papers