Search papers, labs, and topics across Lattice.
The paper analyzes the common practice of initializing new vocabulary tokens in language models with the mean of existing embeddings, revealing that this collapses new tokens into a degenerate subspace and hinders fine-tuning. To address this, they propose Grounded Token Initialization (GTI), which maps new tokens to semantically meaningful locations in the pretrained embedding space using paired linguistic supervision before fine-tuning. GTI outperforms mean initialization and other adaptation methods on generative recommendation benchmarks, demonstrating the importance of initialization quality for vocabulary extension.
Mean-initializing new tokens in LMs creates a degenerate embedding space that cripples fine-tuning, but a simple "grounding" step can unlock significant performance gains in generative recommendation.
Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that \emph{token initialization} is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the \emph{Grounded Token Initialization Hypothesis}: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.