Search papers, labs, and topics across Lattice.
The authors introduce ClaimFlow, a manually annotated dataset of 1,084 claims and 832 cross-paper relations extracted from 304 NLP papers in the ACL Anthology (1979-2025), to explicitly capture how scientific claims evolve. They define and evaluate baseline performance (0.78 macro-F1) on a new task, Claim Relation Classification, which requires models to infer the scientific stance toward a cited claim. Analysis of ~13k NLP papers using ClaimFlow reveals that most claims are never reused, and widely propagated claims are more often reshaped than directly confirmed or refuted.
Most scientific claims in NLP die in obscurity, and even the survivors are more likely to be subtly reshaped than outright validated or debunked.
Scientific papers do more than report results $-$ they advance $\textit{claims}$ that later work supports, extends, or sometimes refutes. Yet existing methods for citation and claim analysis capture only fragments of this dialogue. In this work, we make these interactions explicit at the level of individual scientific claims. We introduce $\texttt{ClaimFlow}$, a claim-centric view of the NLP literature, built from $304$ ACL Anthology papers (1979$-$2025) that are manually annotated with $1{,}084$ claims and $832$ cross-paper claim relations, indicating whether a citing paper $\textit{supports}$, $\textit{extends}$, $\textit{qualifies}$, $\textit{refutes}$, or references a claim as $\textit{background}$. Using $\texttt{ClaimFlow}$, we define a new task $-$ $\textit{Claim Relation Classification}$ $-$ which requires models to infer the scientific stance toward a cited claim from the text and citation context. Evaluating strong neural models and large language models on this task, we report baseline performance of $0.78$ macro-F1, highlighting that claim-relation classification is feasible but challenging. We further apply our model to $\sim$$13k$ NLP papers to analyze how claims evolve across decades of NLP research. Our analysis reveals that $63.5$% claims are never reused; only $11.1$% are ever challenged; meanwhile, widely propagated claims are more often $\textit{reshaped}$ through qualification and extension than directly confirmed or refuted. Overall, $\texttt{ClaimFlow}$ offers a lens for examining how ideas shift and mature within NLP, and a foundation for assessing whether models can interpret scientific argumentation.