May 21, 2026arXiv:2605.22579

Beyond Temperature: Hyperfitting as a Late-Stage Geometric Expansion

Meimingwei Li, Yuanhao Ding, Esteban Garces Arias, Christian Heumann

AI Summary

The paper investigates the "hyperfitting" phenomenon in LLMs, where fine-tuning to near-zero training loss improves open-ended generation, and demonstrates it is distinct from simple temperature scaling or static vocabulary reweighting. Through entropy-matched control experiments and ablation studies, the authors show hyperfitting relies on a dynamic, context-dependent rank reordering mechanism localized to a "Terminal Expansion" in the final transformer block. They introduce Late-Stage LoRA, a targeted fine-tuning strategy updating only the final 5 layers, achieving robust generation with minimal parameter updates.

Key Contribution

Hyperfitting's surprising generation improvements aren't just temperature scaling – they stem from a "Terminal Expansion" in the final transformer block that dynamically reorders token ranks.

Abstract

Recent work has identified a counterintuitive phenomenon termed "Hyperfitting", where fine-tuning Large Language Models (LLMs) to near-zero training loss on small datasets surprisingly enhances open-ended generation quality and mitigates repetition in greedy decoding. While effective, the underlying mechanism remains poorly understood, with the extremely low-entropy output distributions suggesting a potential equivalence to simple temperature scaling. In this work, we demonstrate that this phenomenon is fundamentally distinct from distribution sharpening; entropy-matched control experiments reveal that temperature scaling fails to replicate the diversity gains of hyperfitting. Furthermore, we falsify the hypothesis of static vocabulary reweighting, showing through ablation studies that hyperfitting relies on a dynamic, context-dependent rank reordering mechanism. Layer-wise analysis localizes this effect to a "Terminal Expansion" in the final transformer block, where a substantial geometric expansion of the feature space (Delta Dim approx +80.8) facilitates the promotion of deep-tail tokens. Additionally, we introduce Late-Stage LoRA, a targeted fine-tuning strategy that updates only the final 5 layers, yielding robust generation with minimal parameter updates

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Scaling Laws & Emergent Abilities

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Beyond Temperature: Hyperfitting as a Late-Stage Geometric Expansion

Related Papers