UT AustinMar 2, 2026arXiv:2603.02376

CUCo: An Agentic Framework for Compute and Communication Co-design

Bodun Hu, Yoga Sri Varshan, Saurabh Agarwal, Aditya Akella

AI Summary

The paper introduces CUCo, a training-free agentic framework designed to automatically generate optimized CUDA kernels for large-scale distributed LLM training and inference. CUCo addresses the gap in existing kernel optimization techniques by jointly optimizing computation and communication, which are typically treated separately. The framework achieves up to 1.57x latency reduction compared to state-of-the-art baselines by unlocking novel co-optimization opportunities.

Key Contribution

Stop hand-writing CUDA kernels: CUCo's agent-driven approach co-optimizes computation and communication, slashing LLM training/inference latency by up to 1.57x.

Abstract

Custom CUDA kernel development is essential for maximizing GPU utilization in large-scale distributed LLM training and inference, yet manually writing kernels that jointly leverage both computation and communication remains a labor-intensive and error-prone process. Prior work on kernel optimization has focused almost exclusively on computation, leaving communication kernels largely untouched even though they constitute a significant share of total execution time. We introduce CUCo, a training-free agent-driven workflow that automatically generates high-performance CUDA kernels that jointly orchestrate computation and communication. By co-optimizing these traditionally disjoint components, CUCo unlocks new optimization opportunities unavailable to existing approaches, outperforming state-of-the-art baselines and reducing end-to-end latency by up to $1.57\times$.

Code Generation & Program Synthesis Distributed Systems & Hardware Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

CUCo: An Agentic Framework for Compute and Communication Co-design

Related Papers