Search papers, labs, and topics across Lattice.
The paper introduces STELLAR, an autonomous tuning system for parallel file systems that leverages LLMs to optimize I/O performance. STELLAR uses an agentic approach incorporating RAG, external tool execution, and multi-agent design to extract tunable parameters, analyze I/O traces, select tuning strategies, and iteratively refine configurations based on performance feedback. Evaluations demonstrate that STELLAR achieves near-optimal configurations within a few attempts, significantly outperforming traditional autotuning methods that require extensive iterations.
LLMs can autonomously tune parallel file systems to near-optimal performance in just five attempts, unlocking I/O optimizations previously inaccessible to most domain scientists.
I/O performance is crucial to efficiency in data-intensive scientific computing; but tuning large-scale storage systems is complex, costly, and notoriously manpower-intensive, making it inaccessible for most domain scientists. To address this problem, we propose STELLAR, an autonomous tuner for high-performance parallel file systems. Our evaluations show that STELLAR almost always selects near-optimal configurations for the parallel file systems within the first five attempts, even for previously unseen applications. STELLAR’s human-like efficiency is fundamentally different from existing autotuning methods, which often require hundreds of thousands of iterations to converge. STELLAR achieves this through autonomous end-to-end agentic tuning. Powered by large language models (LLMs), STELLAR is capable of (1) accurately extracting tunable parameters from software manuals, (2) analyzing I/O trace logs generated by applications, (3) selecting initial tuning strategies, (4) rerunning applications on real systems and collecting I/O performance feedback, (5) adjusting tuning strategies and repeating the tuning cycle, and (6) reflecting on and summarizing tuning experiences into reusable knowledge for future optimizations. STELLAR integrates retrieval-augmented generation (RAG), external tool execution, LLM-based reasoning, and a multiagent design to stabilize reasoning and combat hallucinations. We evaluate how each of these components impacts optimization outcomes, thus providing insight into the design of similar systems for other optimization problems. STELLAR’s architecture and empirical validation open new avenues for tackling complex system optimization challenges, especially those characterized by vast search spaces and high exploration costs. Its highly efficient autonomous tuning will broaden access to I/O performance optimizations for domain scientists with minimal additional resource investment.