Abstract
Many real world domain problem datasets are class imbalanced where the number of data in a given class is much less than in the other classes. In related literatures, under- and over-sampling techniques are widely used techniques to re-balance the class imbalanced datasets. However, their limitations include the risk of removing representative majority class data samples and the overfitting problem because of generating a large number of synthetic minority class data samples. Therefore, a novel approach, namely Majority Re-sampling visa Sub-class Clustering (MRSC) is introduced. It uses a clustering algorithm to group the majority class data into several clusters, i.e. sub-classes. Then, a new training set containing multiple sub-classes and a minority class is produced, after which the classifier is trained using this new multi-class dataset which has a lower imbalance ratio than the original dataset. The experimental results obtained using 44 two-class imbalanced datasets show that MRSC combined with the k-NN classifiers, including single and ensemble classifiers, significantly outperforms the other classifiers as well as seven state-of-the-art re-sampling approaches. Moreover, for the clustering algorithms based on affinity propagation and k-means, very similar results can be produced, without significant differences in performance, which indicate the stability of MRSC.
Original language | English |
---|---|
Journal | Journal of Experimental and Theoretical Artificial Intelligence |
DOIs | |
State | Accepted/In press - 2023 |
Keywords
- Clustering
- data mining
- imbalanced datasets
- machine learning
- under-sampling