Wanjiang University of TechnologyFeb 21, 2025

Application and Development of Natural Language Processing Algorithm System in Corpus Linguistics

AI Summary

This paper explores NLP techniques within corpus linguistics, focusing on automatic summarization and machine translation to address challenges like information overload and cross-language communication. The authors employ extractive and generative summarization methods, including Seq2Seq models and GANs, and train a Transformer-based neural machine translation (NMT) model on a multi-language parallel corpus. The key result is that the Transformer-based model significantly outperforms traditional methods, achieving a maximum information retention score of 88.76 in summarization tasks.

Key Contribution

Transformer-based NLP models can achieve up to 88.76 information retention in summarization tasks, significantly outperforming traditional methods.

Abstract

In the digital age, the explosive growth of information has led to significant challenges such as the burden of reading long texts, cross-language communication barriers, and information overload. To address these issues, this paper investigates the application and development of natural language processing (NLP) algorithm systems in corpus linguistics. We employ two methods for automatic summary generation-extractive and generative-which identify key sentences or phrases and utilize Seq2Seq models to produce natural and fluent summaries. For machine translation, we build a multi-language parallel corpus and train a neural machine translation (NMT) model using statistical techniques and the Transformer architecture with attention mechanisms. This approach enhances translation accuracy and fluency. Additionally, we explore summarization technology to combat information overload, improving summary relevance and accuracy through multi-task learning and generative adversarial networks (GANs). Our Re3Sum model further guides text summary generation using real summaries as soft templates. That the Transformer-based model significantly outperforms traditional methods in accuracy, fluency, and information retention, achieving a maximum information retention score of 88.76 points. This reduces the burden of reading long texts and enhances cross-language communication efficiency. Overall, this study not only offers new research directions for the NLP field but also provides practical solutions to language processing challenges.

Data Curation & Synthetic Data Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References13

Year2025

Venue2025 3rd International Conference on Integrated Circuits and Communication Systems (ICICACS)

Related Papers

Finding related papers...

Search

Application and Development of Natural Language Processing Algorithm System in Corpus Linguistics

Related Papers