Apr 28, 2026arXiv:2604.25421

FED-FSTQ: Fisher-Guided Token Quantization for Communication-Efficient Federated Fine-Tuning of LLMs on Edge Devices

Changyu Li, Shuanghong Huang, Jiashen Liu, Ming Lei, Jiduo Xing, Kaishun Wu, Lu Wang, Fei Luo

AI Summary

The paper introduces Fed-FSTQ, a novel federated learning framework for fine-tuning LLMs on edge devices that addresses communication bottlenecks by selectively quantizing and transmitting tokens based on their Fisher information. A lightweight Fisher proxy estimates token sensitivity, enabling importance-aware token selection and non-uniform mixed-precision quantization, which significantly reduces uplink traffic while preserving model accuracy. Experiments on multilingual and medical QA tasks demonstrate that Fed-FSTQ achieves up to 46x reduction in uplink traffic and 52% improvement in wall-clock time compared to standard LoRA.

Key Contribution

Stop wasting bandwidth on irrelevant tokens: Fed-FSTQ uses Fisher information to selectively quantize and transmit only the most important tokens, slashing communication costs in federated LLM fine-tuning by up to 46x.

Abstract

Federated fine-tuning provides a practical route to adapt large language models (LLMs) on edge devices without centralizing private data, yet in mobile deployments the training wall-clock is often bottlenecked by straggler-limited uplink communication under heterogeneous bandwidth and intermittent participation. Although parameter-efficient fine-tuning (PEFT) reduces trainable parameters, per-round payloads remain prohibitive in non-IID regimes, where uniform compression can discard rare but task-critical signals. We propose Fed-FSTQ, a Fisher-guided token quantization system primitive for communication-efficient federated LLM fine-tuning. Fed-FSTQ employs a lightweight Fisher proxy to estimate token sensitivity, coupling importance-aware token selection with non-uniform mixed-precision quantization to allocate higher fidelity to informative evidence while suppressing redundant transmission. The method is model-agnostic, serves as a drop-in module for standard federated PEFT pipelines, e.g., LoRA, without modifying the server aggregation rule, and supports bandwidth-heterogeneous clients via compact sparse message packing. Experiments on multilingual QA and medical QA under non-IID partitions show that Fed-FSTQ reduces cumulative uplink traffic required to reach a fixed quality threshold by 46x relative to a standard LoRA baseline, and improves end-to-end wall-clock time-to-accuracy by 52%. Furthermore, enabling Fisher-guided token reduction at inference yields up to a 1.55x end-to-end speedup on NVIDIA Jetson-class edge devices, demonstrating deployability under tight resource constraints.

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

FED-FSTQ: Fisher-Guided Token Quantization for Communication-Efficient Federated Fine-Tuning of LLMs on Edge Devices

Related Papers