UC DavisApr 2, 2026arXiv:2604.02324

Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

Daiwei Chen, Zhoutong Fu, Chen-Chen Jiang, Haichao Zhang, Ran Zhou, Tan Wang, Chunnan Yao, Guoyao Li, Rui Cai, Yihan Cao, Ruijie Jiang, Fedor Borisyuk, Jianqiang Shen, Jingwei Wu, Ramya Korlakai Vinayak

AI Summary

The paper analyzes the common practice of initializing new vocabulary tokens in language models with the mean of existing embeddings, revealing that this collapses new tokens into a degenerate subspace and hinders fine-tuning. To address this, they propose Grounded Token Initialization (GTI), which maps new tokens to semantically meaningful locations in the pretrained embedding space using paired linguistic supervision before fine-tuning. GTI outperforms mean initialization and other adaptation methods on generative recommendation benchmarks, demonstrating the importance of initialization quality for vocabulary extension.

Key Contribution

Mean-initializing new tokens in LMs creates a degenerate embedding space that cripples fine-tuning, but a simple "grounding" step can unlock significant performance gains in generative recommendation.

Abstract

Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that \emph{token initialization} is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the \emph{Grounded Token Initialization Hypothesis}: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References32

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

Related Papers