Search papers, labs, and topics across Lattice.
This paper introduces CuTe, a novel mathematical specification for representing and manipulating tensor layouts, addressing the increasing complexity of data layouts required by modern hardware like tensor cores. CuTe's hierarchical layout representation extends traditional flat-shape and flat-stride approaches, while its layout algebra supports operations like concatenation, tiling, and inversion for layout manipulation and static analysis. The authors demonstrate CuTe's utility in software development, compile-time verification, and expression of generic tensor transformations, highlighting its deployment in NVIDIA's CUTLASS library.
Representing tensor layouts with a hierarchical algebra unlocks powerful compile-time reasoning and simplifies the expression of tiling/partitioning patterns for specialized hardware.
Modern architectures for high-performance computing and deep learning increasingly incorporate specialized tensor instructions, including tensor cores for matrix multiplication and hardware-optimized copy operations for multi-dimensional data. These instructions prescribe fixed, often complex data layouts that must be correctly propagated through the entire execution pipeline to ensure both correctness and optimal performance. We present CuTe, a novel mathematical specification for representing and manipulating tensors. CuTe introduces two key innovations: (1) a hierarchical layout representation that directly extends traditional flat-shape and flat-stride tensor representations, enabling the representation of complex mappings required by modern hardware instructions, and (2) a rich algebra of layout operations -- including concatenation, coalescence, composition, complementation, division, tiling, and inversion -- that enables sophisticated layout manipulation, derivation, verification, and static analysis. CuTe layouts provide a framework for managing both data layouts and thread arrangements in GPU kernels, while the layout algebra enables powerful compile-time reasoning about layout properties and the expression of generic tensor transformations. In this work, we demonstrate that CuTe's abstractions significantly aid software development compared to traditional approaches, promote compile-time verification of architecturally prescribed layouts, facilitate the implementation of algorithmic primitives that generalize to a wide range of applications, and enable the concise expression of tiling and partitioning patterns required by modern specialized tensor instructions. CuTe has been successfully deployed in production systems, forming the foundation of NVIDIA's CUTLASS library and a number of related efforts including CuTe DSL.