Majority re-sampling via sub-class clustering for imbalanced datasets

Shih Wen Ke, Chih Fong Tsai, Yi Ying Pan, Wei Chao Lin

Research output: Contribution to journalArticlepeer-review

2 Scopus citations


Many real world domain problem datasets are class imbalanced where the number of data in a given class is much less than in the other classes. In related literatures, under- and over-sampling techniques are widely used techniques to re-balance the class imbalanced datasets. However, their limitations include the risk of removing representative majority class data samples and the overfitting problem because of generating a large number of synthetic minority class data samples. Therefore, a novel approach, namely Majority Re-sampling visa Sub-class Clustering (MRSC) is introduced. It uses a clustering algorithm to group the majority class data into several clusters, i.e. sub-classes. Then, a new training set containing multiple sub-classes and a minority class is produced, after which the classifier is trained using this new multi-class dataset which has a lower imbalance ratio than the original dataset. The experimental results obtained using 44 two-class imbalanced datasets show that MRSC combined with the k-NN classifiers, including single and ensemble classifiers, significantly outperforms the other classifiers as well as seven state-of-the-art re-sampling approaches. Moreover, for the clustering algorithms based on affinity propagation and k-means, very similar results can be produced, without significant differences in performance, which indicate the stability of MRSC.


  • Clustering
  • data mining
  • imbalanced datasets
  • machine learning
  • under-sampling


Dive into the research topics of 'Majority re-sampling via sub-class clustering for imbalanced datasets'. Together they form a unique fingerprint.

Cite this