Mar 19, 2026arXiv:2603.18695

High-Performance Portable GPU Primitives for Arbitrary Types and Operators in Julia

AI Summary

KernelForge.jl, a Julia library, achieves vendor-level performance for GPU primitives (scan, mapreduce, matrix-vector) by using a two-layer portable architecture: KernelIntrinsics.jl for backend-agnostic abstractions and KernelForge.jl for high-performance algorithms built on these abstractions. Benchmarking on NVIDIA A40 and AMD MI300X GPUs shows KernelForge.jl matching or exceeding CUB kernel execution time for scan and mapreduce on the A40, and matching cuBLAS throughput for matrix-vector operations across most configurations. This demonstrates that portable JIT-compiled abstractions can achieve vendor-level throughput without sacrificing generality.

Key Contribution

Julia can now hang with the big dogs: KernelForge.jl proves that portable, JIT-compiled GPU primitives can achieve vendor-level performance (matching or exceeding CUB and cuBLAS) without sacrificing generality.

Abstract

Portable GPU frameworks such as Kokkos and RAJA reduce the burden of cross-architecture development but typically incur measurable overhead on fundamental parallel primitives relative to vendor-optimized libraries. We present KernelForge.jl, a Julia library that implements scan, mapreduce, and matrix-vector primitives through a two-layer portable architecture: KernelIntrinsics.jl provides backend-agnostic abstractions for warp-level shuffles, memory fences, and vectorized memory access, while KernelForge.jl builds high-performance algorithms exclusively on top of these interfaces. Evaluated on an NVIDIA A40 and an AMD MI300X, KernelForge.jl matches or exceeds CUB kernel execution time on scan and mapreduce on the A40, and matches cuBLAS throughput on matrix-vector operations across most tested configurations-demonstrating, as a proof of concept, that portable JIT-compiled abstractions can achieve vendor-level throughput without sacrificing generality.

Code Generation & Program Synthesis Distributed Systems & Hardware Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References28

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

High-Performance Portable GPU Primitives for Arbitrary Types and Operators in Julia

Related Papers