MilaAmgenChandar Research LabPoly MontrealMay 22, 2025arXiv:2505.16896

Structure-Aligned Protein Language Model

Can Chen, David Heurtel-Depeiges, Robert M. Vernon, C. Langmead, Y. Bengio, Quentin Fournier

AI Summary

This paper introduces a dual-task framework to inject structural knowledge into protein language models (pLMs) by aligning residue representations with protein graph neural networks (pGNNs) and predicting structure tokens. A residue loss selection module is used to focus training on reliable structural information. Post-training ESM2 and AMPLIFY with this method yields significant improvements in deep mutational scanning fitness prediction and contact prediction, demonstrating robustness across model sizes.

Key Contribution

Dramatically improve protein language models by simply post-training them to align with protein graphs, yielding a 59% increase in contact prediction accuracy.

Abstract

Protein language models (pLMs) pre-trained on vast protein sequence databases excel at various downstream tasks but often lack the structural knowledge essential for some biological applications. To address this, we introduce a method to enrich pLMs with structural knowledge by leveraging pre-trained protein graph neural networks (pGNNs). First, a latent-level contrastive learning task aligns residue representations from pLMs with those from pGNNs across multiple proteins, injecting inter-protein structural information. Additionally, a physical-level task integrates intra-protein information by training pLMs to predict structure tokens. Together, the proposed dual-task framework effectively incorporates both inter- and intra-protein structural knowledge into pLMs. Given the variability in the quality of protein structures in PDB, we further introduce a residue loss selection module that uses a small model trained on high-quality structures to select reliable yet challenging residue losses for the pLM to learn. Applying our structure alignment method as a simple, lightweight post-training step to the state-of-the-art ESM2 and AMPLIFY yields notable performance gains. These improvements are consistent across a wide range of tasks, including substantial gains in deep mutational scanning (DMS) fitness prediction and a 59% increase in P@L for ESM2 650M contact prediction on CASP16. Furthermore, we demonstrate that these performance gains are robust, scaling with model sizes from 8M to 650M and extending to different downstream tasks.

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References46

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Structure-Aligned Protein Language Model

Related Papers