Feature selection and its combination with data over-sampling for multi-class imbalanced datasets

Chih Fong Tsai, Kuan Chen Chen, Wei Chao Lin

研究成果: 雜誌貢獻期刊論文同行評審

1 引文 斯高帕斯(Scopus)

摘要

Feature selection aims at filtering out some unrepresentative features from a given dataset in order to construct more effective learning models. Furthermore, ensemble feature selection by combining multiple feature selection methods has shown its outperformance over single feature selection. However, the performances of different (ensemble) feature selection methods have not been fully examined over multi-class imbalanced datasets. On the other hand, for class imbalanced datasets, one widely considered solution is to re-balance the datasets by data over-sampling, which generates some synthetic examples for the minority classes. However, the effect of performing (ensemble) feature selection on over-sampling multi-class imbalanced datasets has not been investigated. Therefore, the first research objective is to examine the performances of single and ensemble feature selection methods by fifteen well-known filter, wrapper, and embedded algorithms in terms of classification accuracy. For the second research objective, two orders of combining the feature selection and over-sampling steps are compared in order to find out the best combination procedure as well as the best combined algorithms. The experimental results based on ten different domain datasets containing low to very high feature dimensions show that ensemble feature selection methods slightly perform better than single ones. However, their performance differences are not big. To combine with the Synthetic Minority Oversampling Technique (SMOTE) over-sampling algorithm, performing feature selection first and over-sampling second outperforms the other procedure. Although the best combined algorithms are based on ensemble feature selection, eXtreme Gradient Boosting (XGBoost), as the single best feature selection algorithm, combined with SMOTE provides very similar classification performance to the best combined algorithms. To consider the issues of classification performance and compactional cost, the optimal solution is based on the combined XGBoost and SMOTE.

原文???core.languages.en_GB???
文章編號111267
期刊Applied Soft Computing Journal
153
DOIs
出版狀態已出版 - 3月 2024

指紋

深入研究「Feature selection and its combination with data over-sampling for multi-class imbalanced datasets」主題。共同形成了獨特的指紋。

引用此