Search papers, labs, and topics across Lattice.
This paper systematically characterizes the transport-layer limitations of federated learning (FL) systems in resource-constrained edge environments using a reproducible testbed and chaos engineering. The study reveals that FL's burst-idle communication pattern clashes with standard TCP connection management, leading to training failures under high latency, packet loss, and client dropout rates. By adjusting TCP connection management parameters, the authors demonstrate that transport-layer awareness is crucial for reliable FL deployment at the network edge.
Standard federated learning deployments can catastrophically fail with just 5-second latency or 50% packet loss, revealing a fundamental mismatch between FL's communication patterns and default TCP configurations.
Motivated by the growing proliferation of federated learning (FL) in edge environments, we present the first systematic characterization of transport-layer breaking points in FL systems operating under conditions of highly constrained network and compute resources. Using a reproducible testbed with chaos engineering tools, we evaluate Flower under progressively degraded network conditions representative of resource-constrained deployments in Africa and similar environments. Our empirical investigation reveals a fundamental mismatch between FL's burst-idle communication pattern and standard TCP connection management. We identify precise operational boundaries: FL training catastrophically fails at 5-second one-way latency due to TCP handshake timeouts, above 50% packet loss due to buffer exhaustion, and with 90% client dropout rates. Through systematic analysis of connection patterns during training rounds, we demonstrate that FL's periodic model update bursts, separated by extended local training periods, violate the assumptions underlying default TCP configurations. To validate the significance of these findings, we show that adjusting just three TCP connection management parameters can significantly reduce training time under extreme latency, proving that transport-layer awareness is not merely beneficial but essential for FL deployment at the network edge. Our characterization methodology and findings provide practitioners with concrete thresholds for determining when standard FL deployments will fail and when advanced reliability techniques become necessary.