Search papers, labs, and topics across Lattice.
TORAI, a novel unsupervised root cause analysis (RCA) approach, addresses the challenge of blind spots (services without traces) in microservice systems. It leverages multi-source telemetry data to measure anomaly severity, clusters services based on severity symptoms, and uses causal analysis to rank services within each cluster. Experiments on benchmark systems demonstrate that TORAI outperforms state-of-the-art baselines in the presence of blind spots and accurately pinpoints root causes in top-3 recommendations for real-world failures.
Unsupervised root cause analysis is now possible even when you can't see all the services in your call graph.
Existing multi-source root cause analysis (RCA) methods for microservice systems assume all services have traces to construct a service call graph. However, this assumption is not practical as microservice systems evolve rapidly and may contain blackbox services without traces, such as compiled software or unsupported services. We refer to these services as blind spots. In the presence of blind spots, the performance of existing multi-source RCA methods may be affected, as they only diagnose visible services on the call graph. To overcome this limitation, we propose TORAI, a novel unsupervised approach that effectively pinpoints fine-grained root causes without relying on the service call graph. Instead, TORAI first measures anomaly severity using available multi-source telemetry data. It then performs clustering to group services based on their severity symptoms and conducts causal analysis to rank services within each severity cluster. Finally, TORAI aggregates the cluster rankings and uses hypothesis testing to identify fine-grained root causes. TORAI provides an unsupervised approach that leverages available multi-source telemetry data for RCA without requiring a constructed service call graph or further intrusive actions, thus addressing the limitations of existing methods. Our experiments on three benchmark systems demonstrate that TORAI outperforms state-of-the-art baselines remarkably in the presence of blind spots. Performance on real-world failures further shows that TORAI can accurately pinpoint the root causes in top-3 recommendations.