Search papers, labs, and topics across Lattice.
The paper introduces Chain-of-Models Pre-Training (CoM-PT), a method to accelerate vision foundation model (VFM) training by pre-training a sequence of models in ascending size order, using inverse knowledge transfer from smaller predecessors. This approach trains larger models by reusing knowledge in both parameter and feature spaces from smaller, already-trained models, achieving performance comparable to or better than individual training. Experiments across 45 datasets demonstrate significant training cost reduction, with acceleration ratios increasing as the VFM family scales, reaching up to 7.09X with 7 models.
Training more vision models can actually *increase* efficiency, thanks to a novel pre-training strategy that leverages knowledge transfer across a "chain" of models.
In this paper, we present Chain-of-Models Pre-Training (CoM-PT), a novel performance-lossless training acceleration method for vision foundation models (VFMs). This approach fundamentally differs from existing acceleration methods in its core motivation: rather than optimizing each model individually, CoM-PT is designed to accelerate the training pipeline at the model family level, scaling efficiently as the model family expands. Specifically, CoM-PT establishes a pre-training sequence for the model family, arranged in ascending order of model size, called model chain. In this chain, only the smallest model undergoes standard individual pre-training, while the other models are efficiently trained through sequential inverse knowledge transfer from their smaller predecessors by jointly reusing the knowledge in the parameter space and the feature space. As a result, CoM-PT enables all models to achieve performance that is mostly superior to standard individual training while significantly reducing training cost, and this is extensively validated across 45 datasets spanning zero-shot and fine-tuning tasks. Notably, its efficient scaling property yields a remarkable phenomenon: training more models even results in higher efficiency. For instance, when pre-training on CC3M: i) given ViT-L as the largest model, progressively prepending smaller models to the model chain reduces computational complexity by up to 72%; ii) within a fixed model size range, as the VFM family scales across 3, 4, and 7 models, the acceleration ratio of CoM-PT exhibits a striking leap: from 4.13X to 5.68X and 7.09X. Since CoM-PT is naturally agnostic to specific pre-training paradigms, we open-source the code to spur further extensions in more computationally intensive scenarios, such as large language model pre-training.