Apr 21, 2026arXiv:2604.18944

A Mechanism and Optimization Study on the Impact of Information Density on User-Generated Content Named Entity Recognition

Xiaobo Jiang, Dinghong Lai, Song Qiu, Yadong Deng, Xinkai Zhan

AI Summary

This paper investigates the performance collapse of NER models on User-Generated Content (UGC) and identifies low Information Density (ID) as a key causal factor, independent of other noise symptoms. They use hierarchical confounding-controlled resampling experiments and Attention Spectrum Analysis (ASA) to demonstrate that reduced ID leads to "attention blunting" and degraded NER performance. Based on these insights, they propose the Window-Aware Optimization Module (WOM), an LLM-empowered framework that uses selective back-translation to enhance semantic density in information-sparse regions, achieving state-of-the-art results on standard UGC datasets.

Key Contribution

NER performance on user-generated content isn't just about noise – it's fundamentally limited by information density, and targeted augmentation can unlock significant gains.

Abstract

Named Entity Recognition (NER) models trained on clean, high-resource corpora exhibit catastrophic performance collapse when deployed on noisy, sparse User-Generated Content (UGC), such as social media. Prior research has predominantly focused on point-wise symptom remediation -- employing customized fine-tuning to address issues like neologisms, alias drift, non-standard orthography, long-tail entities, and class imbalance. However, these improvements often fail to generalize because they overlook the structural sparsity inherent in UGC. This study reveals that surface-level noise symptoms share a unified root cause: low Information Density (ID). Through hierarchical confounding-controlled resampling experiments (specifically controlling for entity rarity and annotation consistency), this paper identifies ID as an independent key factor. We introduce Attention Spectrum Analysis (ASA) to quantify how reduced ID causally leads to ``attention blunting,''ultimately degrading NER performance. Informed by these mechanistic insights, we propose the Window-Aware Optimization Module (WOM), an LLM-empowered, model-agnostic framework. WOM identifies information-sparse regions and utilizes selective back-translation to directionally enhance semantic density without altering model architecture. Deployed atop mainstream architectures on standard UGC datasets (WNUT2017, Twitter-NER, WNUT2016), WOM yields up to 4.5\% absolute F1 improvement, demonstrating robustness and achieving new state-of-the-art (SOTA) results on WNUT2017.

Data Curation & Synthetic Data Natural Language Processing Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References36

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

A Mechanism and Optimization Study on the Impact of Information Density on User-Generated Content Named Entity Recognition

Related Papers