The George Washington UniversityMar 29, 2026arXiv:2603.27863

Operational Strategies for Non-Disruptive Scheduling Transitions in Production HPC Systems

Glen MacLachlan, Joseph Creech, Rubeel Muhammad Iqbal, Clark Gaylord, Jake Messick

AI Summary

This paper presents a strategy for transitioning a production HPC cluster from node-exclusive to consumable resource scheduling (TRES) mid-lifecycle. The approach combines a time-bounded compatibility layer for legacy jobs, observability-driven feedback on resource usage, and targeted user engagement to encourage adoption of explicit resource declaration. Results show that this non-disruptive transition reduced median queue wait times for CPU and GPU workloads by over 98% and led to high retention among users who adopted TRES-based submissions.

Key Contribution

Switching HPC schedulers mid-lifecycle doesn't have to break everything: a carefully staged transition can dramatically improve queue times and user adoption.

Abstract

Migrating heterogeneous high-performance computing (HPC) systems to resource-aware scheduling introduces both technical and behavioral challenges, particularly in production environments with established user workflows. This paper presents a case study of transitioning a production academic HPC cluster from node-exclusive to consumable resource scheduling mid-lifecycle, without disrupting active workloads. We describe an operational strategy combining a time-bounded compatibility layer, observability-driven feedback, and targeted user engagement to guide adoption of explicit resource declaration. This approach protected active research workflows throughout the transition, avoiding the disruption that a direct cut-over would have imposed on the user community. Following deployment, median queue wait times fell from 277 minutes to under 3 minutes for CPU workloads and from 81 minutes to 3.4 minutes for GPU workloads. Users who adopted TRES-based submission exhibited strong long-term retention. These results demonstrate that successful scheduling transitions depend not only on system configuration, but on aligning observability, user engagement, and operational design.

Distributed Systems & Hardware

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Operational Strategies for Non-Disruptive Scheduling Transitions in Production HPC Systems

Related Papers