Search papers, labs, and topics across Lattice.
This paper investigates the suitability of DAOS and Ceph object storage systems as alternatives to POSIX-based Lustre file systems for the I/O workloads of ECMWF's Numerical Weather Prediction (NWP) system. The authors developed software adapters to enable ECMWF's NWP to utilize these object stores and benchmarked their I/O performance against Lustre on identical hardware. Results indicate that both DAOS and Ceph offer excellent performance, with DAOS demonstrating superior scalability and flexibility compared to Ceph and Lustre for large-scale I/O.
DAOS crushes Lustre in NWP I/O benchmarks, suggesting object storage could be a game-changer for HPC data handling.
Driven by scientific and industry ambition, HPC and AI applications such as operational Numerical Weather Prediction (NWP) require processing and storing ever-increasing data volumes as fast as possible. Whilst POSIX distributed file systems and NVMe SSDs are currently a common HPC storage configuration providing I/O to applications, new storage solutions have proliferated or gained traction over the last decade with potential to address performance limitations POSIX file systems manifest at scale for certain I/O workloads. This work has primarily aimed to assess the suitability and performance of two object storage systems -namely DAOS and Ceph- for the ECMWF's operational NWP as well as for HPC and AI applications in general. New software-level adapters have been developed which enable the ECMWF's NWP to leverage these systems, and extensive I/O benchmarking has been conducted on a few computer systems, comparing the performance delivered by the evaluated object stores to that of equivalent Lustre file system deployments on the same hardware. Challenges of porting to object storage and its benefits with respect to the traditional POSIX I/O approach have been discussed and, where possible, domain-agnostic performance analysis has been conducted, leading to insight also of relevance to I/O practitioners and the broader HPC community. DAOS and Ceph have both demonstrated excellent performance, but DAOS stood out relative to Ceph and Lustre, providing superior scalability and flexibility for applications to perform I/O at scale as desired. This sets a promising outlook for DAOS and object storage, which might see greater adoption at HPC centres in the years to come, although not necessarily implying a shift away from POSIX-like I/O.