Search papers, labs, and topics across Lattice.
This paper introduces MolGram, a novel approach that integrates a conditional $n$-gram memory module into transformer-based molecular language models to address the locality gap caused by standard tokenization of SMILES strings. By mapping local string patterns to learned embeddings and dynamically injecting this context into hidden states, MolGram enhances the model's ability to capture both local syntax and long-range dependencies. Evaluations across molecule generation, reaction prediction, and retrosynthesis tasks reveal that MolGram significantly outperforms baseline models, even those with three times the parameters, highlighting the efficiency of incorporating explicit local pattern memory.
Local pattern memory boosts molecular language model performance, outperforming larger models by leveraging efficient context integration.
Transformer-based language models for SMILES strings suffer from a locality gap: standard character-level tokenization fragments chemically meaningful motifs, forcing models to repeatedly learn local syntax at the expense of long-range dependencies. To address this without disrupting standard tokenizers, we propose MolGram, which integrates a conditional $n$-gram memory module into molecular language models. MolGram maps local string patterns to learned embeddings via scalable hash lookups and dynamically injects this regional context into hidden states. Evaluations across three tasks, including unconditional molecule generation, forward reaction prediction, and single-step retrosynthesis, show that MolGram consistently improves performance. Crucially, our analyses demonstrate that MolGram outperforms baselines with 3$\times$ more parameters, establishing explicit local pattern memory as a highly efficient inductive bias.