CUHKIDEAJun 10, 2026arXiv:2606.12113

Augmenting Molecular Language Models with Local $n$-gram Memory

Xinni Zhang, Zijing Liu, He Cao, Yu Li, Irwin King

AI Summary

This paper introduces MolGram, a novel approach that integrates a conditional $n$-gram memory module into transformer-based molecular language models to address the locality gap caused by standard tokenization of SMILES strings. By mapping local string patterns to learned embeddings and dynamically injecting this context into hidden states, MolGram enhances the model's ability to capture both local syntax and long-range dependencies. Evaluations across molecule generation, reaction prediction, and retrosynthesis tasks reveal that MolGram significantly outperforms baseline models, even those with three times the parameters, highlighting the efficiency of incorporating explicit local pattern memory.

Key Contribution

Local pattern memory boosts molecular language model performance, outperforming larger models by leveraging efficient context integration.

Abstract

Transformer-based language models for SMILES strings suffer from a locality gap: standard character-level tokenization fragments chemically meaningful motifs, forcing models to repeatedly learn local syntax at the expense of long-range dependencies. To address this without disrupting standard tokenizers, we propose MolGram, which integrates a conditional $n$-gram memory module into molecular language models. MolGram maps local string patterns to learned embeddings via scalable hash lookups and dynamically injects this regional context into hidden states. Evaluations across three tasks, including unconditional molecule generation, forward reaction prediction, and single-step retrosynthesis, show that MolGram consistently improves performance. Crucially, our analyses demonstrate that MolGram outperforms baselines with 3$\times$ more parameters, establishing explicit local pattern memory as a highly efficient inductive bias.

Natural Language Processing Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Augmenting Molecular Language Models with Local $n$-gram Memory

Related Papers