Search papers, labs, and topics across Lattice.
This paper develops an analytical theory of pretraining and fine-tuning in diagonal linear networks to understand how initialization impacts feature reuse and refinement. They derive exact expressions for the generalization error as a function of initialization parameters and task statistics, identifying four distinct fine-tuning regimes. The key finding is that a smaller initialization scale in earlier layers enables both feature reuse and refinement, leading to better generalization on fine-tuning tasks that rely on a subset of pretraining features, a result validated empirically on CIFAR-100 with nonlinear networks.
Smaller initialization scales in early layers unlock superior fine-tuning generalization by enabling both feature reuse and refinement, challenging the intuition that larger initializations are always better.
Pretraining and fine-tuning are central stages in modern machine learning systems. In practice, feature learning plays an important role across both stages: deep neural networks learn a broad range of useful features during pretraining and further refine those features during fine-tuning. However, an end-to-end theoretical understanding of how choices of initialization impact the ability to reuse and refine features during fine-tuning has remained elusive. Here we develop an analytical theory of the pretraining-fine-tuning pipeline in diagonal linear networks, deriving exact expressions for the generalization error as a function of initialization parameters and task statistics. We find that different initialization choices place the network into four distinct fine-tuning regimes that are distinguished by their ability to support feature learning and reuse, and therefore by the task statistics for which they are beneficial. In particular, a smaller initialization scale in earlier layers enables the network to both reuse and refine its features, leading to superior generalization on fine-tuning tasks that rely on a subset of pretraining features. We demonstrate empirically that the same initialization parameters impact generalization in nonlinear networks trained on CIFAR-100. Overall, our results demonstrate analytically how data and network initialization interact to shape fine-tuning generalization, highlighting an important role for the relative scale of initialization across different layers in enabling continued feature learning during fine-tuning.