Nihal Balivada

University of Oregon, US 2 Boston College, US 3 Microsoft Research, India Abstract Federated Learning (FL) enables a distributed client-server architecture where multiple clients collaboratively train a global Machine Learning (ML) model without sharing sensitive local data. However, FL often results in lower accuracy than traditional ML algorithms due to statistical heterogeneity across clients. Prior works attempt to address this by using model updates, such as loss and bias, from client models to select participants that can improve the global model’s accuracy. However, these updates neither accurately represent a client’s heterogeneity nor are their selection methods deterministic. We mitigate these limitations by introducing Terraform, a novel client selection methodology that uses gradient updates and a deterministic selection algorithm to select heterogeneous clients for retraining. This bi-pronged approach allows Terraform to achieve up to 47%47\% higher accuracy over prior works. We further demonstrate its efficiency through comprehensive ablation studies and training time analyses, providing strong justification for the robustness of Terraform. 1 Introduction In this paper, we present Terraform, a novel client selection methodology for Federated Learning (FL) McMahan et al. (2017) that addresses heterogeneity in client data distribution by iteratively partitioning clients into two clusters, easy and hard, based on a split index. An FL framework enables a distributed client-server architecture Tanenbaum and van Steen (2007) where multiple clients collaboratively train a global Machine Learning (ML) model without sharing sensitive (local) data. Each client receives the global model’s parameters from the server, uses them to train the model on its data, and shares the updated model parameters with the server. Consequently, the server aggregates the parameters from various clients. This process continues until the global model reaches convergence (highest accuracy). Despite significant seminal research in FL Yang et al. (2019), its practical adoption remains limited due to its lower accuracy yields compared to traditional ML algorithms. This is because the local data of clients is often drawn from different distributions, causing statistical heterogeneity Chen and Vikalo (2024). In an FL environment, statistical heterogeneity represents the variation in data distributions across clients, caused by the variations in feature or label distributions of clients’ local data. It contributes to the non-iid (non-independent & identically distributed) nature of the data. Statistical heterogeneity is a double-edged sword; while it improves model generalization due to data diversity, it increases training complexity and reduces model accuracy. For example, modeling air quality requires training an ML model on environmental features, such as weather conditions and the topography of the location of interest Cheng et al. (2018). Coincidentally, several organizations perform such an air quality modeling on a global scale that spans multiple countries Galmarini and Trivikrama Rao (2011); Pan et al. (2017). However, globally scaling these models requires data sharing, which is challenging due to proprietary concerns or national security. FL is an ideal candidate for training such a model, as it enables localized data collection and training. Unfortunately, FL inadvertently induces statistical heterogeneity, as in this example – one training location may be an urban area with high pollutant emissions, while another is in a rural setting. Although this heterogeneity generalizes the global model, it increases the probability of capturing unwanted patterns, reducing model performance. Prior works attempt to mitigate the negative impact of statistical heterogeneity by designing client selection methodologies that decide which clients should participate in FL training. Lai et al. (2021) estimate the statistical utility of each client based on multiple factors like model loss, device speed, and bandwidth, and then randomly select a subset of high utility clients for retraining. Jee Cho et al. (2022) select “mm” clients with highest local training loss for retraining, where the value of mm is determined by the system. In contrast, Chen and Vikalo (2024) cluster clients into groups using their final layer’s biases and retrain clients in those clusters that have more uniform data distributions. Although these prior works present valuable designs, they are unable to attain higher accuracy for the following two reasons: (i) they assume client updates are either losses or final-layer biases, which do not reveal all details about a client’s data distribution; and (ii) their algorithms for selecting clients to be retrained are not deterministic. In contrast, our methodology, Terraform, aims to eliminate randomness by deterministically selecting which clients need retraining and utilizing updates that precisely account for statistical heterogeneity. Additionally, Terraform ensures that it does not introduce any new computational or communication costs. So how does Terraform meet these goals? Terraform’s novel client selection methodology expects each client to send the final-layer gradient updates returned by its model after training and the dataset size, which capture statistical utility of a client’s data. Terraform uses gradient updates to sort the clients and the dataset size to determine the Inter-Quartile Range, which enables us to find an optimal split index to partition clients into two clusters: easy and hard. Terraform selects and retrains clients in the hard cluster (iterates over the above steps in the process) till only a threshold number of clients remain. To ensure that Terraform does perform in practice, we follow prior works and implement it in two popular FL algorithms, FedAvg and FedProx, and extensively compare Terraform against five state-of-the-art client selection methodologies. Our results illustrate that Terraform outperforms all the client sampling techniques on the five popular FL datasets (FEMNIST, CIFAR10, CIFAR100, FMNIST, and Tiny ImageNet) and increases accuracy of FedAvg and FedProx by up to 47%47\% and 41%41\%, respectively. Additionally, we validate robustness of Terraform’s design through rigorous ablations and training time experiments. 2 The Case for Data Heterogeneity We motivate the need to tackle client data heterogeneity through the following example. US communities regularly face disproportionate threats of disasters like wildfires and fire incidents Federal Emergency Management Agency (1997); Rivers and University (2022); National Interagency Coordination Center (NICC) (2024). In response, US government has accelerated installation of IoT wildfire sensors like PTZ Camera University of Nevada et al. (2025), promoted use of smart home sensors like Google Nest LLC (2025), and funded setup of Hazard workflows like AlertWildfire Smith et al. (2016). These hazard workflows collect observations with sensors, disseminate those observations to remote cloud for processing, infer the impacts of those hazards at cloud, and notify various stakeholders. However, these hazard workflows face two major challenges: (i) they collect user data through sensors and upload them to the cloud, which violates user privacy of Homeland Security (2013); Raffeg and others (2025); Wall and Space.com (2025); U.S. Department of Homeland Security (2019), and (ii) they consume massive network bandwidth due to sensor-cloud communication. This presents an elegant opportunity to leverage FL for disaster response Jin and Du (2024); the sensors can play the role of clients and participate in FL training. Notice that the two type of sensors we consider here, PTZ camera and Google Nest, have distinct deployment range and alert frequency. PTZ cameras are sparse, remote (often in forests), and generate few but highly reliable wildfire alerts. Nest sensors, on the other hand, are abundant in homes and produce frequent non-wildfire alerts with rare true positives due to smoke from wood stoves, prescribed burns, vehicles, etc. Since most training data comes from home sensors, any FL system that does not account for data heterogeneity will overlook the true positive alerts from the cameras, causing the global model to have low accuracy and missing early wildfire signals. Terraform aims to train on these heterogeneous data and ensure that it yields high accuracy for all the events. 3 Background The seminal FedAvg algorithm McMahan et al. (2017) coined the term Federated Learning (FL), where in a distributed client-server architecture, multiple clients collaboratively train a global ML model without sharing local data with the server. We formalize its design as follows: Consider a standard FL system where a server coordinates KK participating clients, indexed by k={1,2,…,K}k=\{1,2,\ldots,K\}. Each kk-th client has a local classification dataset (used for training), Dtraink={(xj(k),yj(k))},D_{\text{train}}^{k}=\{(x^{(k)}_{j},\,y^{(k)}_{j})\}, where xj(k)∈ℝmx^{(k)}_{j}\in\mathbb{R}^{m} is the feature vector, and yj(k)∈{1,…,C}y^{(k)}_{j}\in\{1,\ldots,C\} are the labels with CC classes. Each client computes a model update by minimizing loss function L(θ;xj(k),yj(k))L(\theta;\,x^{(k)}_{j},\,y^{(k)}_{j}) where θ\theta denotes the model parameters. These updates are sent to the server, which performs weighted averaging over the received updates, where weights are proportional to each client’s dataset size, |Dtraink||D_{\text{train}}^{k}|; this leads to updated global model parameters. The server then shares the updated model parameters with the clients, and this process continues for either a pre-defined number of rounds or until the model reaches an accuracy saturation. The accuracy of the global model is evaluated on each client’s test data, DtestkD_{\text{test}}^{k}. Techniques addressing statistical heterogeneity Prior works have explored a variety of approaches to handle statistical heterogeneity in federated learning (FL). Regularization-based techniques discourage client model drift from the global solution by modifying the local objective. For example, FedProx Li et al. (2020) adds a proximal term for more stable convergence, while FedDyn Acar et al. (2021) aligns the global and client optima via a dynamic regularizer, and FedDC Gao et al. (2022) uses an auxiliary variable to decouple and correct gradient and parameter drift by tracking local drift. Variance reduction and drift correction algorithms like SCAFFOLD Karimireddy et al. (2020) remove client drift and achieve comparable convergence rates to centralized SGD using control variates, while FedNova Wang et al. (2020b) eliminates objective inconsistency due to heterogeneity in the number of local updates by normalizing the client updates. Other algorithms like VRL-SGD Liang et al. (2020), FedVRA Wang et al. (2023) and FedRed Jiang et al. (2024) reduce variance via gradient tracking or dual variables. Adaptive optimization and dynamic aggregation methods improve performance under statistical heterogeneity by extending techniques from adaptive optimizers like Adagrad, Adam and Yogi to the federated setting Reddi et al. (2021). For example, FedAW Tang (2024), FedAWA Shi et al. (2025) and FedADp Wu and Wang (2021) make use of client contributions to adaptively adjust aggregation weights. Layer‑wise and architecture‑aware aggregation methods such as FedMA Wang et al. (2020a) use neuron-matching and averaging to improve performance on deep CNN and LSTM architectures. Contrastive approaches like MOON Li et al. (2021a) align global and local representations using contrastive losses. Personalization / multi-mode techniques tackle statistical heterogeneity via knowledge preserving cross-client transfer and create models tailored for each client. These techniques include using locally regularized models personalization (Ditto Li et al. (2021b) with local copies and pFedMe T Dinh et al. (2020) with Moreau envelopes), meta learning for fast adaptation (PeFLL Scott et al. (2024), Per-FedAvg / MOCHA Fallah et al. (2020) and federated MAML Jiang et al. (2019)), representation decoupling (LG-FedAvg Liang et al. (2019), FedPer Arivazhagan et al. (2019), FedRoD Chen and Chao (2022) and FedRep Collins et al. (2021)), parameter/feature alignment (FedAS Yang et al. (2024) and FedPAC Xu et al. (2023)), bilevel optimization methods (pFedHN Shamsian et al. (2021) and FedBabu Oh et al. (2022)), and partitioning the hypothesis space into compatible submodels (FLOCO Grinwald et al. (2024) with solution-simplex methods). Clustered and model-mixture methods cluster clients based on similarity (e.g., FedClust Islam et al. (2024), CFL Sattler et al. (2020), and IFCA Ghosh et al. (2020)) and train a model for each cluster. Data augmentation and distillation techniques like FedDF Lin et al. (2020), FedMD Li and Wang (2019), FedMix Yoon et al. (2021), FedFed Yang et al. (2023), FedProto Tan et al. (2021) and FADA Peng et al. (2020) reduce data statistical heterogeneity by using proxy data or shared statistics. Finally, client selection and sampling techniques balance efficiency and representativeness by systematically picking the clients participating in each round of federated training. These techniques prioritize clients based on a quantification of their utility — statistical PoW-D Jee Cho et al. (2022), system-based (FedCS Nishio and Yonetani (2019)), or both (Oort Lai et al. (2021)) — towards the expected progress per round. They also consider the diversity or correlation between client updates (DivFL Balakrishnan et al. (2022) and FedCor Tang et al. (2022)) and population structure captured by clustered sampling Fraboni et al. (2021), including advanced hierarchical variants like HiCS-FL Chen and Vikalo (2024), which additionally uses output-layer bias update information for heterogeneity-aware client selection. 4 Related Work Client Selection Methodologies. The FedAvg algorithm selects participating clients randomly in each training round. However, this random selection can lead to client drift—a phenomenon in which the global model either converges slowly or becomes trapped in local optima, often due to the repeated selection of highly heterogeneous clients in each round Karimireddy et al. (2020). FedProx Li et al. (2020) extends FedAvg and addresses client drift by adding a regularization term that penalizes large deviations from the global model. This keeps the local updates more aligned with the global optima as well as improves the stability in model training. As a result, a broad body of federated learning research focuses on designing principled client selection methodologies that can recognize client heterogeneity and concentrate retraining efforts on the most impactful clients. Lai et al. (2021) introduce Oort, which estimates the statistical utility of each client in FL training based on how effectively its updates improve the global model. This utility is measured as a combination of model loss, device speed, and bandwidth, and helps to assign each client a selection probability. Oort sorts the clients based on these probabilities and randomly selects a subset of high-probability clients. Additionally, it randomly selects a few low-probability clients to reduce training bias. Jee Cho et al. (2022) introduce power of choice (PoC), which selects clients with high local training loss. The server first randomly samples a pool of candidate clients and queries their local training loss. Next, it selects mm clients with the highest losses, where the value of mm is determined by the system. HiCS-FL Chen and Vikalo (2024) uses bias values from the final layer of client models to determine the complexity of its dataset. It uses these biases to cluster clients into groups, assigns them a sampling probability, and trains clients in clusters with high sampling probability. Specifically, HiCS-FL targets clients with less complexity, which is opposite to our goal. Note: while PoC and Oort use loss as client updates, HiCS-FL uses biases. Further, these protocols do not collect updates in each round, which makes them outdated and prevents them from giving a holistic representation of the client’s local model. In contrast, Terraform requires clients to send final-layer gradient updates, which is the sum of both weights and biases, to compute a client’s statistical heterogeneity. These gradient updates are a better indicator of the client’s distribution due to their proximity to the output layer, and they capture a more direct signal than loss of how well a client’s model has learned Zeiler and Fergus (2014). Hierarchical Splitting. The hierarchical selection approach in Terraform is inspired by the CART (Classification and Regression Trees) model Breiman et al. (1984), which recursively partitions data at decision nodes into left and right child nodes using a partitioning criterion. Hierarchical sampling has also been applied in the meta-learning setting for spatio-temporal domains Xie et al. (2021); Liu et al. (2023), where prior works focus on applying hierarchical splitting in a centralized setup. In contrast, Terraform introduces hierarchical splitting in a federated (i.e., distributed) learning setting, where only gradients are exchanged between clients and the server–minimizing communication overhead and preserving privacy. Furthermore, Terraform explicitly accounts for local dataset heterogeneity during client-side training, improving model performance in non-iid scenarios. Figure 1: Terraform Framework. 5 Terraform Overview Terraform is a novel client selection methodology, which when employed by an existing federated learning (FL) algorithm helps it to select statistically heterogeneous clients that negatively impact the accuracy of the FL algorithm. We use Figure 1 to illustrate the design of Terraform. Like any other methodology, Terraform requires rounds to train the clients. However, each round of Terraform performs hierarchical client selection through a series of iterations. A single iteration of Terraform includes the following steps. The server starts with the global model

Meta AI (FAIR)

Papers on Lattice

Total citations

Topics

Research focus

Distributed Systems & Hardware (1)Training Efficiency & Optimization (1)

Frequent co-authors

Shrey Gupta (1)Shashank Shreedhar Bhatt (1)Suyash Gupta (1)

Papers (1)

Feb 24, 2026

Meta AIFeb 24, 2026

Heterogeneity-Aware Client Selection Methodology For Efficient Federated Learning

A deterministic client selection method leveraging gradient updates can boost federated learning accuracy by nearly 50% in heterogeneous environments.

Nihal Balivada, Shrey Gupta, Shashank Shreedhar Bhatt +1

Distributed Systems & Hardware Training Efficiency & Optimization

Search

Nihal Balivada

Research focus

Frequent co-authors

Papers (1)