Search papers, labs, and topics across Lattice.
This paper introduces SysOM-AI, a production observability system that continuously monitors AI training performance across CPU, GPU, and network layers using eBPF-based tracing and adaptive hybrid stack unwinding. By integrating CPU stack profiling, GPU kernel tracing, and NCCL event instrumentation, SysOM-AI achieves a low overhead of less than 0.4%. Deployed at Alibaba across 80,000 GPUs, SysOM-AI significantly reduced the median diagnosis time for production issues from days to approximately 10 minutes.
Pinpointing performance bottlenecks in large-scale AI training just got 100x faster, thanks to a new system that watches the whole stack without slowing things down.
Performance diagnosis in production-scale AI training is challenging because subtle OS-level issues can trigger cascading GPU delays and network slowdowns, degrading training efficiency across thousands of GPUs. Existing profiling tools are limited to single system layers, incur prohibitive overhead (10--30%), or lack continuous deployment capabilities, resulting in manual analyses spanning days. We argue that continuous, cross-layer observability enabled by OS-level instrumentation and layered differential diagnosis is necessary to address this gap. We introduce SysOM-AI, a production observability system that continuously integrates CPU stack profiling, GPU kernel tracing, and NCCL event instrumentation via adaptive hybrid stack unwinding and eBPF-based tracing, incurring less than 0.4% overhead. Deployed at Alibaba across over 80,000 GPUs for more than one year, SysOM-AI helped diagnose 94 confirmed production issues, reducing median diagnosis time from days to approximately 10 minutes.