Search papers, labs, and topics across Lattice.
This paper introduces Compass, an expert-guided LLM agent framework using a Knowledge Tree to extract marine lead (Pb) data from unstructured academic papers. Compass decomposes complex extraction tasks into verifiable steps, ensuring scientific validity without fine-tuning. Applied to 230,000 papers, Compass extracted 3,751 previously unincorporated Pb records with 92% accuracy, creating the largest integrated marine Pb database and expanding coverage in under-sampled regions.
Forget fine-tuning: expert-guided LLM agents can unlock vast troves of scientific data buried in unstructured papers with surprising accuracy.
Marine lead (Pb) and its isotopes are critical tracers for ocean circulation and anthropogenic pollution, yet in-situ observations remain costly and sparse. While vast historical records exist, they lie buried within the unstructured content of academic papers, creating"data silos"inaccessible to comprehensive analysis. Manual extraction is unscalable, while general-purpose Large Language Models (LLMs) lack the necessary domain-specific knowledge, leading to hallucinations and scientifically invalid outputs. To address this, we introduce an expert-guided adaptation approach that enables LLMs to perform rigorous scientific data extraction without fine-tuning. We operationalize this approach through Compass, an LLM agent framework enhanced by a Knowledge Tree co-designed with marine scientists, which decomposes complex tasks into verifiable steps, guiding the agent's reasoning to ensure scientific validity. Deploying Compass across a corpus of over 230,000 relevant open-access papers, we successfully extract 3,751 previously unincorporated Pb records. This effort establishes the largest integrated marine Pb database to date. Beyond standard metrics, Compass demonstrates superior reliability through multi-layered validation, achieving 92% accuracy as confirmed through expert manual verification. The newly integrated data expand coverage in previously under-sampled regions such as the East China Sea and the Southern Ocean, providing an enriched data foundation for future scientific discoveries. We release an interactive visualization platform to facilitate open scientific access. Our work demonstrates that expert-guided agents can effectively bridge the gap between general-purpose LLMs and high-stakes scientific domains, enabling scalable data discovery in geosciences.