Search papers, labs, and topics across Lattice.
This paper investigates the impact of heterogeneity in microservice systems on root cause localization (RCL). They find that entity-level heterogeneity, particularly asymmetric cross-layer interactions between services and hosts, significantly influences fault propagation. To address this, they propose NexusRCL, a semi-supervised framework that models services and hosts as distinct node types in a heterogeneous graph, achieving substantial improvements in RCL accuracy compared to existing methods.
Ignoring the nuanced interplay between services and hosts in microservice architectures leaves nearly 50% of root causes undiscovered.
Microservice root cause localization is fundamentally challenged by the inherent heterogeneity of cloud-native systems, which encompasses diverse observability data and multiple system entities. Existing approaches typically focus on only one aspect of heterogeneity and thus fail to capture its full diagnostic value. In this work, we systematically examine the multifaceted role of heterogeneity within both microservice systems and the RCL process. This analysis motivates a deeper investigation into how entity-level distinctions and their asymmetric dependencies influence fault behavior. Our empirical analysis of two microservice benchmarks reveals that entity-level heterogeneity naturally gives rise to heterogeneous fault propagation, which is highly asymmetric and dominated by cross-layer interactions between services and hosts. In light of this, we propose NexusRCL, a semi-supervised framework that internalizes these propagation patterns by formalizing services and hosts as distinct node types within a heterogeneous graph. This design, coupled with an event-based abstraction mechanism, allows NexusRCL to effectively capture both data level and entity-level heterogeneity while minimizing labeling costs through active learning. Comprehensive evaluations on two industrial benchmark datasets demonstrate NexusRCL's superior performance, achieving improvements of up to 49.85\% in Top-1 accuracy (A@1) and 32.70\% in Average Top-5 accuracy (A@5) compared to state-of-the-art methods.