Search papers, labs, and topics across Lattice.
This thesis addresses critical limitations in automated anomaly detection and root cause analysis (RCA) for microservice systems by introducing innovative methods that leverage both observability data and event data, while also providing standardized benchmarking datasets and evaluation frameworks. The proposed frameworks, including BARO for metric data, EventADL for event data, and TORAI for multimodal RCA without a service call graph, demonstrate significant improvements in effectiveness and robustness through extensive experiments on real microservice systems. By systematically evaluating existing RCA methods, particularly those based on causal inference, the research offers valuable insights that can shape future advancements in incident mitigation and remediation.
Automated anomaly detection in microservices can now harness both metric and event data, overcoming traditional limitations and enhancing reliability.
Microservice systems are widely used to build cloud applications, yet their complexity makes failures inevitable, degrading user experience and causing economic loss. Automated anomaly detection and root cause analysis (RCA) are now active research areas, but existing techniques share five limitations. First, most treat anomaly detection and RCA separately, assuming anomalies are detected correctly, and falter when detection is imprecise due to noise or delay. Second, they focus on metrics, logs, and traces, leaving event data such as API calls and configuration changes underexplored. Third, many require a given service call graph and cannot diagnose without one. Fourth, the field lacks standardised datasets and evaluation frameworks, so methods are hard to compare fairly. Fifth, although causal inference-based RCA has become dominant, its effectiveness, efficiency, and robustness remain unclear. This thesis addresses these limitations through two groups of contributions. The first introduces methods that exploit observability data both independently and collectively. BARO is an end-to-end anomaly detection and RCA approach for metric data. EventADL is an end-to-end framework for event data. TORAI is a multimodal RCA framework that requires no service call graph. Extensive experiments on real microservice systems demonstrate their effectiveness and robustness. The second group delivers benchmarking datasets, an evaluation framework, and systematic evaluation efforts. RCAEval is a comprehensive benchmark providing ready-to-use datasets and reproducible baselines for future research. A systematic evaluation of existing RCA methods, especially causal inference-based approaches, offers insights that guide future directions. This thesis thereby advances automated anomaly detection and RCA for microservice failures, enabling future research on incident mitigation and remediation.