Search papers, labs, and topics across Lattice.
This paper investigates the impact of clock skew on the causal correctness of observability data in distributed AI inference pipelines. Through controlled experiments with introduced clock skew, the authors show that even small skews (>= 5ms) can lead to causality violations in observability, despite the system maintaining functional correctness and performance. They further demonstrate that the severity of these violations can change over time due to clock drift, highlighting the dynamic nature of the problem.
Clock skew as small as 5ms can break causality in observability data from distributed AI inference systems, even when the system is working perfectly.
Distributed AI inference pipelines rely heavily on timestamp-based observability to understand system behavior. This work demonstrates that even small clock skew between nodes can cause observability to become causally incorrect while the system itself remains functionally correct and performant. We present controlled experiments on a multi-node AI inference pipeline, where clock skew is introduced at a single stage. Results show that no violations are observed under synchronized conditions and up to 3 ms skew, while clear causality violations emerge by 5 ms. Despite this, system throughput and output correctness remain largely unaffected. We further observe that violation behavior is not strictly static. In longer runs, negative span rates may stabilize or decrease over time, indicating that effective skew evolves due to relative clock drift between nodes. Experiments were conducted using Kafka and ZeroMQ transports, with consistent results across both. Aeron is under active exploration but is not yet included in the completed validation set. These findings suggest that observability correctness depends not only on system functionality but also on precise time alignment, and that timing must be treated as a first-class concern in distributed AI systems.