Acceleration ConsortiumKAUSTPrincetonUofTVectorMay 6, 2026arXiv:2605.05104

Building informative materials datasets beyond targeted objectives

Rafael Espinosa Castañeda, Ashley Dale, Hongchen Wang, Yonatan Kurniawan, Hao Wan, Runze Zhang, Adji Bousso Dieng, Kangming Li, Jason Hattrick-Simpers

AI Summary

This paper introduces a diversity-aware framework for materials science dataset construction that optimizes for both targeted and untargeted properties. The framework uses diversity-aware selection to ensure broad coverage of the materials space, addressing the issue of datasets being poorly suited for future learning tasks due to researchers prioritizing a subset of properties. Results show that this approach prevents performance degradation on untargeted properties (up to 40% improvement over random sampling) and enhances performance on targeted properties (up to 25% improvement).

Key Contribution

Don't let your materials science dataset become obsolete: a diversity-aware construction framework can boost performance on both targeted and *untargeted* properties by up to 40%.

Abstract

Materials science data collection can be expensive, making the reuse and long-term utility of datasets critical important for future discovery campaigns. In practice, researchers prioritize a subset of properties due to research interests. However, ignoring a subset of outcomes in data collection campaigns potentially generate datasets poorly suited for future learning tasks. Here, we present a framework for dataset construction that maximizes informativeness for target properties of interest while preserving performance on untargeted ones. Our approach uses diversity-aware selection to ensure broad coverage of the materials space. In noisy experimental dataset construction, we find that without our diversity-aware framework, prediction performance on untargeted properties can degrade by up to 40% relative to random sampling, whereas applying our framework yields improvements of up to 10% . For targeted properties, performance can degrade with respect to random sampling by up to 12.5% without diversity, while our framework achieves gains of up to 25%. Incorporating diversity into dataset construction not only preserves informativeness for the targeted properties, but also improves materials coverage for potential future objectives. As a result, the constructed datasets remain broadly informative across considered and unconsidered outcomes, ensuring unbiased quality entries and mitigating cold-start limitations in subsequent modeling and discovery campaigns.

Data Curation & Synthetic Data Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Building informative materials datasets beyond targeted objectives

Related Papers