May 5, 2026arXiv:2605.03566

Lifting to tensors when compiling scientific computing workloads for AI Engines

AI Summary

This paper introduces a compilation pipeline that maps OpenMP-decorated loops in scientific codes to AMD's AI Engines (AIEs) by lifting loop semantics into tensors. This tensor representation enables efficient mapping to AIEs, reducing the need for extensive code modification. Experiments on six kernel benchmarks show that the AIE performs comparably to a multicore CPU for float32 operations with reduced energy consumption, and CPU+AIE execution yields up to 40% performance improvement and 15% energy reduction for scientific kernels.

Key Contribution

Get up to 40% performance boost and 15% energy savings on scientific computing kernels by offloading OpenMP loops to AMD's AI Engines with minimal code changes.

Abstract

It has been demonstrated that specialised architectures, such as FPGAs and AMD's AI Engines (AIEs), have the potential to deliver energy and performance advantages for scientific computing. Given the integration of AIEs into AMD's CPUs, this is an interesting potential avenue especially when executing on the edge or making better use of local compute constrained resources. However, a major challenge is in enabling existing codes to run on this architecture without extensive modification. Put simply, it requires significant expertise and time to port codes to the AIE's execution model. In this paper we explore a compilation pipeline for efficiently mapping loops in general purpose, scientific codes to AIEs. Lifting the semantics of an application into tensors, we demonstrate that this is able to capture the intention of general purpose loops annotated with OpenMP and such high-level tensor information provides a richness that is effective when mapping to the AIEs. Requiring only an OpenMP decorated loop, our approach significantly reduces code complexity when targeting the architecture. For six kernel benchmarks, representing AI and scientific computing, using our approach the NPU performs comparatively to the multicore CPU for float32, in all cases at reduced energy to solution. For two scientific computing kernels running across both the CPU and NPU together delivers up to a 40% improvement in performance and 15% reduction in energy usage compared to the CPU alone.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References17

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Lifting to tensors when compiling scientific computing workloads for AI Engines

Related Papers