Mar 9, 2026arXiv:2603.08881

From Word2Vec to Transformers: Text-Derived Composition Embeddings for Filtering Combinatorial Electrocatalysts

AI Summary

This paper explores a label-free screening strategy for combinatorial electrocatalysts, representing each composition using embeddings derived from scientific texts. It compares Word2Vec and transformer-based embeddings, encoding compositions via element-wise mixing or short prompts. The method filters candidates based on similarity to conductivity and dielectric property concepts, assessed across 15 materials libraries.

Key Contribution

Surprisingly, a lightweight Word2Vec model often outperforms transformer-based embeddings in filtering electrocatalyst candidates, achieving greater reduction in possible compositions while maintaining performance.

Abstract

Compositionally complex solid solution electrocatalysts span vast composition spaces, and even one materials system can contain more candidate compositions than can be measured exhaustively. Here we evaluate a label-free screening strategy that represents each composition using embeddings derived from scientific texts and prioritizes candidates based on similarity to two property concepts. We compare a corpus-trained Word2Vec baseline with transformer-based embeddings, where compositions are encoded either by linear element-wise mixing or by short composition prompts. Similarities to `concept directions', the terms conductivity and dielectric, define a 2-dimensional descriptor space, and a symmetric Pareto-front selection is used to filter candidate subsets without using electrochemical labels. Performance is assessed on 15 materials libraries including noble metal alloys and multicomponent oxides. In this setting, the lightweight Word2Vec baseline, which uses a simple linear combination of element embeddings, often achieves the highest number of reductions of possible candidate compositions while staying close to the best measured performance.

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References29

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

From Word2Vec to Transformers: Text-Derived Composition Embeddings for Filtering Combinatorial Electrocatalysts

Related Papers