Search papers, labs, and topics across Lattice.
The authors introduce MedConclusion, a dataset of 5.7M PubMed structured abstracts paired with author-written conclusions, designed to evaluate LLMs' ability to infer scientific conclusions from biomedical evidence. They benchmark diverse LLMs using conclusion and summary prompting, evaluating outputs with reference-based metrics and LLM-as-a-judge. Results show that conclusion writing differs from summary writing, current automatic metrics struggle to differentiate strong models, and the choice of LLM judge significantly impacts scores.
LLMs struggle to synthesize scientific conclusions from structured biomedical evidence, and current metrics fail to capture nuanced differences in their reasoning abilities.
Large language models (LLMs) are widely explored for reasoning-intensive research tasks, yet resources for testing whether they can infer scientific conclusions from structured biomedical evidence remain limited. We introduce MedConclusion, a large-scale dataset of 5.7M PubMed structured abstracts for biomedical conclusion generation. Each instance pairs the non-conclusion sections of an abstract with the original author-written conclusion, providing naturally occurring supervision for evidence-to-conclusion reasoning. MedConclusion also includes journal-level metadata such as biomedical category and SJR, enabling subgroup analysis across biomedical domains. As an initial study, we evaluate diverse LLMs under conclusion and summary prompting settings and score outputs with both reference-based metrics and LLM-as-a-judge. We find that conclusion writing is behaviorally distinct from summary writing, strong models remain closely clustered under current automatic metrics, and judge identity can substantially shift absolute scores. MedConclusion provides a reusable data resource for studying scientific evidence-to-conclusion reasoning. Our code and data are available at: https://github.com/Harvard-AI-and-Robotics-Lab/MedConclusion.