Search papers, labs, and topics across Lattice.
The Science Data Lake is introduced as a unified, locally-deployable infrastructure built on DuckDB and Parquet files, integrating eight open scholarly sources via DOI normalization. An embedding-based ontology alignment using BGE-large sentence embeddings maps OpenAlex topics to 13 scientific ontologies, achieving high coverage (99.8%) and F1 score (0.77) outperforming baselines. The resource, comprising ~293 million papers, is validated through automated checks, citation agreement analysis, and manual annotation, enabling cross-source analyses previously infeasible.
Finally, a unified, open-source "Science Data Lake" lets you query 293 million papers across eight sources with a single SQL query, complete with embedding-based ontology alignment for semantic search.
Scholarly data are largely fragmented across siloed databases with divergent metadata and missing linkages among them. We present the Science Data Lake, a locally-deployable infrastructure built on DuckDB and simple Parquet files that unifies eight open sources - Semantic Scholar, OpenAlex, SciSciNet, Papers with Code, Retraction Watch, Reliance on Science, a preprint-to-published mapping, and Crossref - via DOI normalization while preserving source-level schemas. The resource comprises approximately 960GB of Parquet files spanning ~293 million uniquely identifiable papers across ~22 schemas and ~153 SQL views. An embedding-based ontology alignment using BGE-large sentence embeddings maps 4,516 OpenAlex topics to 13 scientific ontologies (~1.3 million terms), yielding 16,150 mappings covering 99.8% of topics ($\geq 0.65$ threshold) with $F1 = 0.77$ at the recommended $\geq 0.85$ operating point, outperforming TF-IDF, BM25, and Jaro-Winkler baselines on a 300-pair gold-standard evaluation. We validate through 10 automated checks, cross-source citation agreement analysis (pairwise Pearson $r = 0.76$ - $0.87$), and stratified manual annotation. Four vignettes demonstrate cross-source analyses infeasible with any single database. The resource is open source, deployable on a single drive or queryable remotely via HuggingFace, and includes structured documentation suitable for large language model (LLM) based research agents.