TY - JOUR
T1 - Towards hybrid over- and under-sampling combination methods for class imbalanced datasets
T2 - an experimental study
AU - Lin, Cian
AU - Tsai, Chih Fong
AU - Lin, Wei Chao
N1 - Publisher Copyright:
© 2022, The Author(s), under exclusive licence to Springer Nature B.V.
PY - 2023/2
Y1 - 2023/2
N2 - The skewed class distributions of many class imbalanced domain datasets often make it difficult for machine learning techniques to construct effective models. In such cases, data re-sampling techniques, such as under-sampling the majority class and over-sampling the minority class are usually employed. In related literatures, some studies have shown that hybrid combinations of under- and over-sampling methods with differ orders can produce better results. However, each study only compares with either under- or over-sampling methods to make the final conclusion. Therefore, the research objective of this paper is to find out which order of combining under- and over-sampling methods perform better. Experiments are conducted based on 44 different domain datasets using three over-sampling algorithms, including SMOTE, CTGAN, and TAN, and three under-sampling (i.e. instance selection) algorithms, including IB3, DROP3, and GA. The results show that if the under-sampling algorithm is chosen carefully, i.e. IB3, no significant performance improvement is obtained by further addition of the over-sampling step. Furthermore, with the IB3 algorithm, it is better to perform instance selection first and over-sampling second than the other combination order, which can allow the random forest classifier to provide the highest AUC rate.
AB - The skewed class distributions of many class imbalanced domain datasets often make it difficult for machine learning techniques to construct effective models. In such cases, data re-sampling techniques, such as under-sampling the majority class and over-sampling the minority class are usually employed. In related literatures, some studies have shown that hybrid combinations of under- and over-sampling methods with differ orders can produce better results. However, each study only compares with either under- or over-sampling methods to make the final conclusion. Therefore, the research objective of this paper is to find out which order of combining under- and over-sampling methods perform better. Experiments are conducted based on 44 different domain datasets using three over-sampling algorithms, including SMOTE, CTGAN, and TAN, and three under-sampling (i.e. instance selection) algorithms, including IB3, DROP3, and GA. The results show that if the under-sampling algorithm is chosen carefully, i.e. IB3, no significant performance improvement is obtained by further addition of the over-sampling step. Furthermore, with the IB3 algorithm, it is better to perform instance selection first and over-sampling second than the other combination order, which can allow the random forest classifier to provide the highest AUC rate.
KW - Class imbalance
KW - Data science
KW - Machine learning
KW - Over-sampling
KW - Under-sampling
UR - http://www.scopus.com/inward/record.url?scp=85128194312&partnerID=8YFLogxK
U2 - 10.1007/s10462-022-10186-5
DO - 10.1007/s10462-022-10186-5
M3 - 期刊論文
AN - SCOPUS:85128194312
SN - 0269-2821
VL - 56
SP - 845
EP - 863
JO - Artificial Intelligence Review
JF - Artificial Intelligence Review
IS - 2
ER -