Search papers, labs, and topics across Lattice.
This paper introduces a diversity-aware framework for materials science dataset construction that optimizes for both targeted and untargeted properties. The framework uses diversity-aware selection to ensure broad coverage of the materials space, addressing the issue of datasets being poorly suited for future learning tasks due to researchers prioritizing a subset of properties. Results show that this approach prevents performance degradation on untargeted properties (up to 40% improvement over random sampling) and enhances performance on targeted properties (up to 25% improvement).
Don't let your materials science dataset become obsolete: a diversity-aware construction framework can boost performance on both targeted and *untargeted* properties by up to 40%.
Materials science data collection can be expensive, making the reuse and long-term utility of datasets critical important for future discovery campaigns. In practice, researchers prioritize a subset of properties due to research interests. However, ignoring a subset of outcomes in data collection campaigns potentially generate datasets poorly suited for future learning tasks. Here, we present a framework for dataset construction that maximizes informativeness for target properties of interest while preserving performance on untargeted ones. Our approach uses diversity-aware selection to ensure broad coverage of the materials space. In noisy experimental dataset construction, we find that without our diversity-aware framework, prediction performance on untargeted properties can degrade by up to 40% relative to random sampling, whereas applying our framework yields improvements of up to 10% . For targeted properties, performance can degrade with respect to random sampling by up to 12.5% without diversity, while our framework achieves gains of up to 25%. Incorporating diversity into dataset construction not only preserves informativeness for the targeted properties, but also improves materials coverage for potential future objectives. As a result, the constructed datasets remain broadly informative across considered and unconsidered outcomes, ensuring unbiased quality entries and mitigating cold-start limitations in subsequent modeling and discovery campaigns.