Search papers, labs, and topics across Lattice.
This paper introduces a graph-based anomaly detection system for microservice architectures, specifically within Prime Video, using a Graph Convolutional Network-based Graph Autoencoder (GCN-GAE). The system learns structural representations of service dependencies from directed, weighted graphs at minute-level resolution, comparing embeddings from load tests and live events to identify under-represented services. Results show high precision (96%) in detecting anomalies related to documented incidents, although recall (58%) is limited, highlighting the system's potential for early incident detection and the need for improved anomaly injection frameworks.
Prime Video's new anomaly detection system spots real incident-related services missed by traditional load testing, proving that synthetic traffic can't always predict live event behavior.
Prime Video regularly conducts load tests to simulate the viewer traffic spikes seen during live events such as Thursday Night Football as well as video-on-demand (VOD) events such as Rings of Power. While these stress tests validate system capacity, they can sometimes miss service behaviors unique to real event traffic. We present a graph-based anomaly detection system that identifies under-represented services using unsupervised node-level graph embeddings. Built on a GCN-GAE, our approach learns structural representations from directed, weighted service graphs at minute-level resolution and flags anomalies based on cosine similarity between load test and event embeddings. The system identifies incident-related services that are documented and demonstrates early detection capability. We also introduce a preliminary synthetic anomaly injection framework for controlled evaluation that show promising precision (96%) and low false positive rate (0.08%), though recall (58%) remains limited under conservative propagation assumptions. This framework demonstrates practical utility within Prime Video while also surfacing methodological lessons and directions, providing a foundation for broader application across microservice ecosystems.