Search papers, labs, and topics across Lattice.
This paper introduces LATS-RCA, a multi-agent framework leveraging LLMs for automated root cause analysis (RCA) in microservice-based systems. LATS-RCA employs a Language Agent Tree Search algorithm, where multiple LLM agents analyze logs and metrics to explore potential root causes in a tree-structured manner, guided by reflection scores. Experiments on the Light-OAuth2 system and a production environment demonstrate LATS-RCA's diagnostic accuracy and highlight the challenges of applying it to complex, real-world systems with diverse technologies and logging practices.
LLMs can now collaboratively pinpoint root causes in microservices using a tree-structured search, but production environments reveal the limitations of this approach when faced with polyglot stacks and inconsistent logging.
Recent advances in large language models (LLMs) have enabled early attempts to automate root cause analysis (RCA) in microservice-based systems (MSS). Yet, prior works typically rely on a linear reasoning process that proceeds along a single diagnostic path. In this paper, we propose LATS-RCA, an LLM-based multi-agent framework for RCA in MSS. LATS-RCA formulates RCA as a reflection-guided tree-structured search using a Language Agent Tree Search algorithm. In LATS-RCA, multiple LLM-driven agents iteratively perform RCA for each microservice by reasoning over its execution logs and performance metrics to collect operational evidence for root cause exploration. Reflection scores derived from intermediate diagnostic states are used to guide the search toward the most likely root cause based on accumulated evidence. We evaluate LATS-RCA on the open-source industrial MSS, Light-OAuth2 (LO2), using a publicly available dataset and in a production microservice environment (Prod) in a case company with substantially higher operational complexity. LO2 is a small-team Java system with a homogeneous technology stack. The results on LO2 show that LATS-RCA achieves high diagnostic accuracy, and we further benchmark its associated computational costs. Compared to LO2, Prod attains lower diagnostic accuracy and incurs higher computational cost. The Prod deployment demonstrates the practical applicability of LATS-RCA in real-world MSS and reflects the challenges introduced by polyglot tech stack, varied logging practices of source components, and multi-factor root-causes by production-scale MSS.