May 25, 2026arXiv:2605.25645

Fine-Tuning and Serving Gemma 4 31B on Google Cloud TPU: A Technical Comparison with GPU Baselines

Jatin Kishnani, Mayank Goel, Amit Singh, Pulkit Agrawal, Sairanjan Mishra

AI Summary

This paper demonstrates fine-tuning and serving Google's Gemma 4 31B model on TPUs, bridging a gap in open tooling for TPU-based LLM adaptation. The authors ported a GPU-native PyTorch/HuggingFace training pipeline to JAX/Tunix/Qwix, detailing necessary code-level adaptations for mesh configuration, sharding, and checkpointing. Empirical results show that TPU v5p-8 training is 1.61x faster and 2.12x cheaper than a 2xH100 GPU baseline, with comparable inference throughput and 2x lower time-to-first-token on TPU v6e-8.

Key Contribution

TPUs aren't just for Google anymore: Gemma 4 fine-tuning is 1.6x faster and 2x cheaper than on GPUs, with faster time-to-first-token for inference.

Abstract

We present the first end-to-end demonstration of fine-tuning and serving Google's Gemma 4 31B model on TPU hardware, providing an empirical comparison of TPU and GPU platforms for large language model adaptation. Using LoRA on a Google TPU v5p-8 for training and TPU v6e-8 (Trillium) for inference, we document the full set of code-level adaptations required to port a GPU-native training recipe, built on PyTorch, HuggingFace TRL, and FSDP, to the JAX + Tunix/Qwix stack. These adaptations span mesh configuration, LoRA module naming conventions, sharding annotation corrections, gradient checkpointing, data pipeline restructuring, and a custom Orbax-to-safetensors checkpoint merging procedure. For inference, we detail the vLLM-TPU Docker setup necessary to serve Gemma 4 on v6e-8 and characterize the resulting latency and throughput profile. Compared with a 2xH100 GPU baseline under identical hyperparameters, TPU training completes 1.61x faster at 2.12x lower cost. Inference throughput is within 3% across platforms, while TPU achieves 2x lower time-to-first-token (235 ms vs. 475 ms). Together, the TPU configuration is 1.82x cheaper for a representative train-plus-service workload. Our work removes a critical gap in the open tooling ecosystem and provides practitioners with a reproducible, production-ready recipe for Gemma 4 deployment on TPU infrastructure.

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Fine-Tuning and Serving Gemma 4 31B on Google Cloud TPU: A Technical Comparison with GPU Baselines

Related Papers