CASApr 23, 2026arXiv:2604.21724

Beyond N-gram: Data-Aware X-GRAM Extraction for Efficient Embedding Parameter Scaling

Yilong Chen, Yan Xie, Zitian Gao, He Xin, Yihao Xiao, Ren-Shuo Liu, Haoming Luo, Yifan Luo, Zhengmao Ye, Tingwen Liu, Xin Zhao, Ran Tao, Bryan Dai

AI Summary

The paper introduces X-GRAM, a frequency-aware dynamic token-injection framework designed to improve the parameter efficiency of token-indexed lookup tables in large language models. X-GRAM uses hybrid hashing and alias mixing to compress the long tail of infrequent tokens and refines retrieved vectors via normalized SwiGLU ShortConv to extract diverse local n-gram features. Experiments at 0.73B and 1.15B scales demonstrate that X-GRAM improves average accuracy by up to 4.4 points compared to vanilla backbones and 3.2 points over strong retrieval baselines, while using smaller tables.

Key Contribution

By dynamically injecting frequency-aware n-gram features, X-GRAM achieves state-of-the-art accuracy with smaller embedding tables, offering a practical path to scaling memory-augmented architectures.

Abstract

Large token-indexed lookup tables provide a compute-decoupled scaling path, but their practical gains are often limited by poor parameter efficiency and rapid memory growth. We attribute these limitations to Zipfian under-training of the long tail, heterogeneous demand across layers, and"slot collapse"that produces redundant embeddings. To address this, we propose X-GRAM, a frequency-aware dynamic token-injection framework. X-GRAM employs hybrid hashing and alias mixing to compress the tail while preserving head capacity, and refines retrieved vectors via normalized SwiGLU ShortConv to extract diverse local n-gram features. These signals are integrated into attention value streams and inter-layer residuals using depth-aware gating, effectively aligning static memory with dynamic context. This design introduces a memory-centric scaling axis that decouples model capacity from FLOPs. Extensive evaluations at the 0.73B and 1.15B scales show that X-GRAM improves average accuracy by as much as 4.4 points over the vanilla backbone and 3.2 points over strong retrieval baselines, while using substantially smaller tables in the 50% configuration. Overall, by decoupling capacity from compute through efficient memory management, X-GRAM offers a scalable and practical paradigm for future memory-augmented architectures. Code aviliable in https://github.com/Longyichen/X-gram.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References25

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Beyond N-gram: Data-Aware X-GRAM Extraction for Efficient Embedding Parameter Scaling

Related Papers