Search papers, labs, and topics across Lattice.
This paper introduces a physics-augmented machine learning framework for predicting molecular properties, specifically normal boiling points, by using thermodynamic descriptors derived from molecular dynamics simulations as features. A CatBoost regression model is trained on ensemble-averaged cohesive energies, heats of vaporization, and densities obtained from liquid-phase simulations. The key result is that this physics-based approach demonstrates superior extrapolation capabilities compared to conventional structure-based models, particularly for chemical classes absent from the training data, including inorganic compounds and molecules with uncommon elements.
Forget hand-engineered structural features: this method uses molecular dynamics-derived thermodynamic properties to train ML models that extrapolate to completely novel chemical spaces, like inorganics, where traditional models fail.
Machine learning (ML) models which rely on molecular structure excel at predicting properties for well-represented organic compounds, however their limited ability to extrapolate to chemotypes outside their training domain, remains a critical bottleneck in chemical discovery. This challenge is particularly acute in industrial discovery, where navigating uncharted chemical space to generate new intellectual property is a primary objective. Normal boiling points serve as a key benchmark for testing the extrapolative power of ML algorithms. A major limitation is that group-contribution methods are by design unable to generate predictions for molecules containing unparameterized fragments. Here, we demonstrate that this limitation can be overcome by replacing structural descriptors with thermodynamic properties computed directly from molecular dynamics simulations. We introduce a physics-augmented framework where a CatBoost regression model learns directly from ensemble-averaged cohesive energies, heats of vaporization, and densities extracted from atomistic liquid-phase simulations. Benchmark comparisons reveal that while both our physics-augmented model and conventional structure-based models perform comparably well on standard organic compounds, only the former maintains controlled error growth when extrapolating to structurally dissimilar chemical space. Our model successfully predicts boiling points for chemical classes entirely absent from training -- including inorganic compounds, salts, and molecules with elements like Si, B, and Te -- where structure-based models are fundamentally inapplicable. By encoding the intermolecular forces governing phase behavior, our framework establishes a generalizable strategy for property prediction beyond the structural boundaries of the existing methods.