Search papers, labs, and topics across Lattice.
This paper introduces a temporal taxonomy of news-document trajectories to identify "anticipatory outliers" – documents that precede and signal the emergence of new topics in dynamic topic modeling. They implement this taxonomy in a cumulative clustering setting using document embeddings from 11 language models, evaluating it on a French news corpus about the hydrogen economy. Results show that inter-model agreement identifies a high-confidence subset of these anticipatory outliers, demonstrating the potential of outliers as early signals for topic discovery.
Outliers aren't just noise: some are early harbingers of entirely new topics, detectable by tracking document trajectories.
Outliers in dynamic topic modeling are typically treated as noise, yet we show that some can serve as early signals of emerging topics. We introduce a temporal taxonomy of news-document trajectories that defines how documents relate to topic formation over time. It distinguishes anticipatory outliers, which precede the topics they later join, from documents that either reinforce existing topics or remain isolated. By capturing these trajectories, the taxonomy links weak-signal detection with temporal topic modeling and clarifies how individual articles anticipate, initiate, or drift within evolving clusters. We implement it in a cumulative clustering setting using document embeddings from eleven state-of-the-art language models and evaluate it retrospectively on HydroNewsFr, a French news corpus on the hydrogen economy. Inter-model agreement reveals a small, high-consensus subset of anticipatory outliers, increasing confidence in these labels. Qualitative case studies further illustrate these trajectories through concrete topic developments.