UT AustinFeb 19, 2026arXiv:2602.17565

Optimal Unconstrained Self-Distillation in Ridge Regression: Strict Improvements, Precise Asymptotics, and One-Shot Tuning

Hien Dang, Pratik Patil, Alessandro Rinaldo

AI Summary

This paper provides a theoretical analysis of unconstrained self-distillation (SD) in ridge regression, demonstrating that the optimally mixed student strictly improves upon the ridge teacher for any squared prediction risk when the teacher's risk is nonstationary. The authors derive a closed-form expression for the optimal mixing weight and establish its sign rule, showing that it can be negative in over-regularized regimes. Furthermore, they develop exact deterministic equivalents for the optimal SD risk in the proportional asymptotics regime and propose a consistent one-shot tuning method for estimating the optimal mixing weight.

Key Contribution

Self-distillation isn't just a trick: this paper proves it *provably* improves ridge regression performance, even with negative mixing weights in over-regularized regimes, and offers a one-shot tuning method to make it practical.

Abstract

Self-distillation (SD) is the process of retraining a student on a mixture of ground-truth labels and the teacher's own predictions using the same architecture and training data. Although SD has been empirically shown to often improve generalization, its formal guarantees remain limited. We study SD for ridge regression in unconstrained setting in which the mixing weight $ξ$ may be outside the unit interval. Conditioned on the training data and without any distributional assumptions, we prove that for any squared prediction risk (including out-of-distribution), the optimally mixed student strictly improves upon the ridge teacher for every regularization level $λ> 0$ at which the teacher ridge risk $R(λ)$ is nonstationary (i.e., $R'(λ) \neq 0$). We obtain a closed-form expression for the optimal mixing weight $ξ^\star(λ)$ for any value of $λ$ and show that it obeys the sign rule: $\operatorname{sign}(ξ^\star(λ))=-\operatorname{sign}(R'(λ))$. In particular, $ξ^\star(λ)$ can be negative, which is the case in over-regularized regimes. To quantify the risk improvement due to SD, we derive exact deterministic equivalents for the optimal SD risk in the proportional asymptotics regime (where the sample and feature sizes $n$ and $p$ both diverge but their aspect ratio $p/n$ converges) under general anisotropic covariance and deterministic signals. Our asymptotic analysis extends standard second-order ridge deterministic equivalents to their fourth-order analogs using block linearization, which may be of independent interest. From a practical standpoint, we propose a consistent one-shot tuning method to estimate $ξ^\star$ without grid search, sample splitting, or refitting. Experiments on real-world datasets and pretrained neural network features support our theory and the one-shot tuning method.

Inference & Quantization Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Optimal Unconstrained Self-Distillation in Ridge Regression: Strict Improvements, Precise Asymptotics, and One-Shot Tuning

Related Papers