Search papers, labs, and topics across Lattice.
The paper introduces TiledAttention, a PyTorch-callable scaled dot-product attention (SDPA) kernel implemented in cuTile Python (TileIR) for research on NVIDIA GPUs. TiledAttention allows for rapid, reproducible kernel research by enabling schedule-level modifications directly from Python, such as tile shapes and shared-memory layout, without requiring extensive CUDA/CUTLASS rewrites. Benchmarks on an NVIDIA DGX GB10 node demonstrate that TiledAttention achieves significant speedups over standard eager attention paths, offering a practical balance between performance and customizability, although production fused baselines are still faster.
Ditch the CUDA boilerplate: TiledAttention lets you rapidly prototype and tweak custom attention kernels directly from Python, unlocking faster iteration on novel SDPA architectures.
TiledAttention is a scaled dot-product attention (SDPA) forward operator for SDPA research on NVIDIA GPUs. Implemented in cuTile Python (TileIR) and exposed as a PyTorch-callable function, it is easier to modify than low-level CUDA templates while retaining realistic behavior via online softmax and tiled $K,V$ streaming. The approach is both performant and directly editable at the schedule level from Python (tile shapes, staging, shared-memory layout), enabling rapid, reproducible kernel research without template-heavy CUDA/CUTLASS rewrites. We benchmark TiledAttention on an NVIDIA DGX GB10 node with a reproducible harness and compare against PyTorch SDPA (auto-dispatch) and explicit unfused baselines across sequence length, head dimension, and precision (FP16/BF16). While production fused baselines remain stronger overall, TiledAttention delivers large speedups over standard eager attention paths and is available for direct use within PyTorch workflows, providing a practical balance between performance and customizability.