Chittagong University of Engineering and TechnologyDhaka International UniversityJun 30, 2025

Enhanced Semantic Relatedness Prediction Using XLM‐RoBERTa and CNNs With K‐Fold Cross‐Validation

Md. Ayon Mia, Fahim Shakil Tamim, Z. Taheri, M. Talukder

AI Summary

The paper introduces a system for predicting semantic relatedness between English sentences using a fine-tuned XLM-RoBERTa-large model combined with a CNN architecture and k-fold cross-validation. This approach was evaluated on the SemRel2024 dataset, a benchmark for semantic textual relatedness. The system achieved state-of-the-art performance, with a Spearman's correlation of 0.854 and a Pearson's correlation of 0.863, demonstrating significant improvements over other transformer-based models and traditional machine learning approaches.

Key Contribution

Fine-tuning XLM-RoBERTa-large with a CNN and k-fold cross-validation achieves state-of-the-art semantic relatedness prediction, surpassing other transformers and traditional methods on the SemRel2024 benchmark.

Abstract

Accurately measuring the semantic relatedness between sentences is crucial for various natural language processing (NLP) tasks, including question answering, text summarization, and information retrieval. This study introduces a system designed to precisely evaluate the relatedness of English sentences. Utilizing the SemRel2024 dataset, a comprehensive benchmark for semantic textual relatedness (STR), we conducted baseline experiments across multiple monolingual settings. Our proposed method integrates k‐fold cross‐validation with a fine‐tuned XLMRoBERTa‐large model and a convolutional neural network (CNN) architecture, achieving the highest Spearman's correlation of 0.854 and Pearson's correlation of 0.863. We also explored several transformer‐based models (RoBERTa‐base, RoBERTa‐large, BERT‐large) and their architectural variations, as well as the effects of attention mechanisms with word embeddings such as Word2Vec, FastText, and global vectors (GloVe). Among traditional machine learning models, the TF‐IDF + random forest (RF) model exhibited the best performance. Our findings demonstrate the potential for significant advancements in NLP applications through enhanced semantic relatedness prediction, thereby improving machine translation, information retrieval, and the development of sophisticated language models capable of nuanced understanding.

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References14

Year2025

VenueJurnal Engineering

Related Papers

Finding related papers...

Search

Enhanced Semantic Relatedness Prediction Using XLM‐RoBERTa and CNNs With K‐Fold Cross‐Validation

Related Papers