Aristotle University of ThessalonikiChongqingNJUSoochowZJUJun 4, 2026arXiv:2606.05927

Addressing Imbalance in Multi-Label Data via Label-Specific Distance-based Oversampling

Bin Liu, Jun Wu, Haoyu Peng, Ao Zhou, Jin Wang, QiaoSong Chen, Grigorios Tsoumakas

AI Summary

This paper introduces Label-Specific Distance-based Multi-Label Oversampling (LSDMLO), a novel method designed to address the challenges of imbalanced label distributions in multi-label classification. By leveraging label-specific distances to identify semantically relevant neighbors, LSDMLO generates synthetic instances that maintain label consistency and reduce overfitting. Experimental results demonstrate that LSDMLO significantly outperforms existing oversampling techniques across various classifiers, highlighting its effectiveness in improving classification performance on imbalanced datasets.

Key Contribution

Label-Specific Distance-based Oversampling reveals that tailoring synthetic instance generation to label-specific feature relevance can drastically enhance multi-label classification performance.

Abstract

The complex imbalanced label distribution poses a crucial challenge to multi-label classification, as most classifiers are biased towards the majority class and high-frequent labels. Oversampling is an efficient and flexible solution that augments instances to provide a more balanced training dataset for multi-label classifiers. Most existing oversampling methods create synthetic instances in a heuristic way that essentially relies on neighborhood information retrieved using Euclidean distance within the entire feature space. However, they fail to consider the varying semantic relevance of features to different labels, leading to label inconsistency among proximate neighbors and further introducing label confusion and overfitting to synthetic instances. To overcome the above issue, we propose a novel sampling approach called Label-Specific Distance-based Multi-Label Oversampling (LSDMLO) that creates more useful and well-labeled synthetic instances to address the imbalance in multi-label datasets. LSDMLO derives the label-specific distance to identify label-consistent neighbors based on the weighted pertinent feature space, which facilitates selecting seed instances that express more label correlations in boundary areas and generating synthetic instances aligned with the label distribution of original data. The comprehensive experiments verify that the proposed LSDMLO outperforms the state-of-the-art multi-label sampling approaches under various base classifiers.

Data Curation & Synthetic Data

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Addressing Imbalance in Multi-Label Data via Label-Specific Distance-based Oversampling

Related Papers