Search papers, labs, and topics across Lattice.
This paper investigates the performance collapse of NER models on User-Generated Content (UGC) and identifies low Information Density (ID) as a key causal factor, independent of other noise symptoms. They use hierarchical confounding-controlled resampling experiments and Attention Spectrum Analysis (ASA) to demonstrate that reduced ID leads to "attention blunting" and degraded NER performance. Based on these insights, they propose the Window-Aware Optimization Module (WOM), an LLM-empowered framework that uses selective back-translation to enhance semantic density in information-sparse regions, achieving state-of-the-art results on standard UGC datasets.
NER performance on user-generated content isn't just about noise – it's fundamentally limited by information density, and targeted augmentation can unlock significant gains.
Named Entity Recognition (NER) models trained on clean, high-resource corpora exhibit catastrophic performance collapse when deployed on noisy, sparse User-Generated Content (UGC), such as social media. Prior research has predominantly focused on point-wise symptom remediation -- employing customized fine-tuning to address issues like neologisms, alias drift, non-standard orthography, long-tail entities, and class imbalance. However, these improvements often fail to generalize because they overlook the structural sparsity inherent in UGC. This study reveals that surface-level noise symptoms share a unified root cause: low Information Density (ID). Through hierarchical confounding-controlled resampling experiments (specifically controlling for entity rarity and annotation consistency), this paper identifies ID as an independent key factor. We introduce Attention Spectrum Analysis (ASA) to quantify how reduced ID causally leads to ``attention blunting,''ultimately degrading NER performance. Informed by these mechanistic insights, we propose the Window-Aware Optimization Module (WOM), an LLM-empowered, model-agnostic framework. WOM identifies information-sparse regions and utilizes selective back-translation to directionally enhance semantic density without altering model architecture. Deployed atop mainstream architectures on standard UGC datasets (WNUT2017, Twitter-NER, WNUT2016), WOM yields up to 4.5\% absolute F1 improvement, demonstrating robustness and achieving new state-of-the-art (SOTA) results on WNUT2017.