Mar 3, 2026arXiv:2603.03126

The Science Data Lake: A Unified Open Infrastructure Integrating 293 Million Papers Across Eight Scholarly Sources with Embedding-Based Ontology Alignment

AI Summary

The Science Data Lake is introduced as a unified, locally-deployable infrastructure built on DuckDB and Parquet files, integrating eight open scholarly sources via DOI normalization. An embedding-based ontology alignment using BGE-large sentence embeddings maps OpenAlex topics to 13 scientific ontologies, achieving high coverage (99.8%) and F1 score (0.77) outperforming baselines. The resource, comprising ~293 million papers, is validated through automated checks, citation agreement analysis, and manual annotation, enabling cross-source analyses previously infeasible.

Key Contribution

Finally, a unified, open-source "Science Data Lake" lets you query 293 million papers across eight sources with a single SQL query, complete with embedding-based ontology alignment for semantic search.

Abstract

Scholarly data are largely fragmented across siloed databases with divergent metadata and missing linkages among them. We present the Science Data Lake, a locally-deployable infrastructure built on DuckDB and simple Parquet files that unifies eight open sources - Semantic Scholar, OpenAlex, SciSciNet, Papers with Code, Retraction Watch, Reliance on Science, a preprint-to-published mapping, and Crossref - via DOI normalization while preserving source-level schemas. The resource comprises approximately 960GB of Parquet files spanning ~293 million uniquely identifiable papers across ~22 schemas and ~153 SQL views. An embedding-based ontology alignment using BGE-large sentence embeddings maps 4,516 OpenAlex topics to 13 scientific ontologies (~1.3 million terms), yielding 16,150 mappings covering 99.8% of topics ($\geq 0.65$ threshold) with $F1 = 0.77$ at the recommended $\geq 0.85$ operating point, outperforming TF-IDF, BM25, and Jaro-Winkler baselines on a 300-pair gold-standard evaluation. We validate through 10 automated checks, cross-source citation agreement analysis (pairwise Pearson $r = 0.76$ - $0.87$), and stratified manual annotation. Four vignettes demonstrate cross-source analyses infeasible with any single database. The resource is open source, deployable on a single drive or queryable remotely via HuggingFace, and includes structured documentation suitable for large language model (LLM) based research agents.

Data Curation & Synthetic Data Recommendation & Information Retrieval Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References34

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

The Science Data Lake: A Unified Open Infrastructure Integrating 293 Million Papers Across Eight Scholarly Sources with Embedding-Based Ontology Alignment

Related Papers