Tsinghua AIRUCSoyeon Caren Han is the correspondingMar 10, 2026arXiv:2603.09229

Flash-KMeans: Fast and Memory-Efficient Exact K-Means

Shuo Yang, Shuo Yang, Haocheng Xi, Haocheng Xi, Yilong Zhao, Yilong Zhao, Muyang Li, Muyang Li, Xiaoze Fan, Xiaoze Fan, Jintao Zhang, Han Cai, Han Cai, Yujun Lin, Yujun Lin, Xiuyu Li, Xiuyu Li, Kurt Keutzer, Song Han, Chenfeng Xu, Ion Stoica

AI Summary

Flash-KMeans optimizes the k-means algorithm for modern GPUs by addressing IO bottlenecks in the assignment stage and atomic write contention in the centroid update stage. It introduces FlashAssign, which fuses distance computation with online argmin to avoid materializing the distance matrix, and sort-inverse update, which transforms atomic scatters into localized reductions. Evaluations on NVIDIA H200 GPUs show up to 17.9x speedup over existing implementations and significant performance gains over cuML and FAISS.

Key Contribution

K-means, previously relegated to offline processing, gets a 17.9x speed boost on modern GPUs thanks to Flash-KMeans' clever IO and contention optimizations.

Abstract

$k$-means has historically been positioned primarily as an offline processing primitive, typically used for dataset organization or embedding preprocessing rather than as a first-class component in online systems. In this work, we revisit this classical algorithm under the lens of modern AI system design and enable $k$-means as an online primitive. We point out that existing GPU implementations of $k$-means remain fundamentally bottlenecked by low-level system constraints rather than theoretical algorithmic complexity. Specifically, the assignment stage suffers from a severe IO bottleneck due to the massive explicit materialization of the $N \times K$ distance matrix in High Bandwidth Memory (HBM). Simultaneously, the centroid update stage is heavily penalized by hardware-level atomic write contention caused by irregular, scatter-style token aggregations. To bridge this performance gap, we propose flash-kmeans, an IO-aware and contention-free $k$-means implementation for modern GPU workloads. Flash-kmeans introduces two core kernel-level innovations: (1) FlashAssign, which fuses distance computation with an online argmin to completely bypass intermediate memory materialization; (2) sort-inverse update, which explicitly constructs an inverse mapping to transform high-contention atomic scatters into high-bandwidth, segment-level localized reductions. Furthermore, we integrate algorithm-system co-designs, including chunked-stream overlap and cache-aware compile heuristics, to ensure practical deployability. Extensive evaluations on NVIDIA H200 GPUs demonstrate that flash-kmeans achieves up to 17.9$\times$ end-to-end speedup over best baselines, while outperforming industry-standard libraries like cuML and FAISS by 33$\times$ and over 200$\times$, respectively.

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References30

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Flash-KMeans: Fast and Memory-Efficient Exact K-Means

Related Papers