Search papers, labs, and topics across Lattice.
This paper introduces a visual analytics system for understanding HPC system behavior by integrating dimensionality reduction, contrastive learning, and dynamic mode decomposition to analyze unlabeled, high-dimensional system monitoring data. The system identifies meaningful node clusters and reveals behavioral differences within and across these groups by visualizing metrics like CPU utilization and memory activity. Case studies demonstrate the tool's effectiveness in detecting and interpreting anomalous behaviors, confirmed by expert feedback.
HPC admins can now visually explore complex system behavior and detect subtle anomalies using a new tool that automatically clusters nodes and highlights performance variations.
In high-performance computing (HPC) environments, system monitoring data is often unlabeled and high-dimensional, making it difficult to reliably detect and understand anomalous computing nodes. The growing scale and dimensionality of the collected datasets present significant challenges for analysis and visualization tasks. We present a scalable, interactive visual analytics system to support exploration, explanation, and comparison of compute node behaviors in HPC systems. Our approach integrates an analysis workflow combining two-phase dimensionality reduction with contrastive learning and multi-resolution dynamic mode decomposition to capture inter- and intra-cluster variations. These analyses are embedded in an interactive interface that enables users to explore clusters, compare temporal patterns, and iteratively refine hypotheses through customizable visual encodings and baselines. By integrating metrics such as CPU utilization and memory activity, the system offers a holistic view of large-scale system behavior. We demonstrate the utility of our tool through two case studies. In both cases, our system automatically identified meaningful node clusters and revealed subtle behavioral differences within and across node groups. Expert feedback confirmed the effectiveness of our tool in enhancing anomalous behavior detection and interpretation. Our work advances scalable visual analysis for HPC monitoring and has broader implications for cloud, edge computing, and distributed infrastructures where interpretability and behavior analysis are critical to operational efficiency.