Polish Academy of SciencesUGentFeb 18, 2026arXiv:2602.16507

Small molecule retrieval from tandem mass spectrometry: what are we optimizing for?

Gaetan De Waele, Marek Wydmuch, Krzysztof Dembczyński, Wojciech Kotłowski, Willem Waegeman

AI Summary

This paper investigates the impact of different loss functions on deep learning models for small molecule retrieval from tandem mass spectrometry data. It focuses on the trade-off between optimizing for accurate molecular fingerprint prediction versus optimizing for accurate molecular retrieval from a database. The authors demonstrate theoretically and empirically that improving fingerprint prediction accuracy often degrades retrieval performance, and vice versa, with the severity of the trade-off depending on the similarity structure of the candidate sets.

Key Contribution

Optimizing deep learning models for accurate molecular fingerprint prediction can actually *worsen* your ability to retrieve the correct molecule from mass spectrometry data.

Abstract

One of the central challenges in the computational analysis of liquid chromatography-tandem mass spectrometry (LC-MS/MS) data is to identify the compounds underlying the output spectra. In recent years, this problem is increasingly tackled using deep learning methods. A common strategy involves predicting a molecular fingerprint vector from an input mass spectrum, which is then used to search for matches in a chemical compound database. While various loss functions are employed in training these predictive models, their impact on model performance remains poorly understood. In this study, we investigate commonly used loss functions, deriving novel regret bounds that characterize when Bayes-optimal decisions for these objectives must diverge. Our results reveal a fundamental trade-off between the two objectives of (1) fingerprint similarity and (2) molecular retrieval. Optimizing for more accurate fingerprint predictions typically worsens retrieval results, and vice versa. Our theoretical analysis shows this trade-off depends on the similarity structure of candidate sets, providing guidance for loss function and fingerprint selection.

Recommendation & Information Retrieval Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Small molecule retrieval from tandem mass spectrometry: what are we optimizing for?

Related Papers