Search papers, labs, and topics across Lattice.
This paper introduces a revised annotation scheme for cross-document coreference resolution (CDCR) that treats coreference chains as discourse elements (DEs) to capture lexical diversity and framing variations, particularly in news coverage. They re-annotated the NewsWCL50 dataset and a subset of ECB+ using a unified codebook to accommodate both identity and near-identity relations between mentions. Evaluation using lexical diversity metrics and a same-head-lemma baseline demonstrates that the re-annotated datasets exhibit a balanced level of lexical diversity, making them suitable for discourse-aware CDCR research.
Unlock richer insights from news analysis by embracing lexical diversity: a new cross-document coreference dataset links mentions like "the caravan," "asylum seekers," and "those contemplating illegal entry."
Cross-document coreference resolution (CDCR) identifies and links mentions of the same entities and events across related documents, enabling content analysis that aggregates information at the level of discourse participants. However, existing datasets primarily focus on event resolution and employ a narrow definition of coreference, which limits their effectiveness in analyzing diverse and polarized news coverage where wording varies widely. This paper proposes a revised CDCR annotation scheme of the NewsWCL50 dataset, treating coreference chains as discourse elements (DEs) and conceptual units of analysis. The approach accommodates both identity and near-identity relations, e.g., by linking "the caravan" - "asylum seekers" - "those contemplating illegal entry", allowing models to capture lexical diversity and framing variation in media discourse, while maintaining the fine-grained annotation of DEs. We reannotate the NewsWCL50 and a subset of ECB+ using a unified codebook and evaluate the new datasets through lexical diversity metrics and a same-head-lemma baseline. The results show that the reannotated datasets align closely, falling between the original ECB+ and NewsWCL50, thereby supporting balanced and discourse-aware CDCR research in the news domain.