Towards hybrid over- and under-sampling combination methods for class imbalanced datasets: an experimental study

Cian Lin, Chih Fong Tsai, Wei Chao Lin

Research output: Contribution to journalArticlepeer-review

Abstract

The skewed class distributions of many class imbalanced domain datasets often make it difficult for machine learning techniques to construct effective models. In such cases, data re-sampling techniques, such as under-sampling the majority class and over-sampling the minority class are usually employed. In related literatures, some studies have shown that hybrid combinations of under- and over-sampling methods with differ orders can produce better results. However, each study only compares with either under- or over-sampling methods to make the final conclusion. Therefore, the research objective of this paper is to find out which order of combining under- and over-sampling methods perform better. Experiments are conducted based on 44 different domain datasets using three over-sampling algorithms, including SMOTE, CTGAN, and TAN, and three under-sampling (i.e. instance selection) algorithms, including IB3, DROP3, and GA. The results show that if the under-sampling algorithm is chosen carefully, i.e. IB3, no significant performance improvement is obtained by further addition of the over-sampling step. Furthermore, with the IB3 algorithm, it is better to perform instance selection first and over-sampling second than the other combination order, which can allow the random forest classifier to provide the highest AUC rate.

Original languageEnglish
JournalArtificial Intelligence Review
DOIs
StateAccepted/In press - 2022

Keywords

  • Class imbalance
  • Data science
  • Machine learning
  • Over-sampling
  • Under-sampling

Fingerprint

Dive into the research topics of 'Towards hybrid over- and under-sampling combination methods for class imbalanced datasets: an experimental study'. Together they form a unique fingerprint.

Cite this