TY - JOUR
T1 - Ensemble feature selection in high dimension, low sample size datasets
T2 - Parallel and serial combination approaches
AU - Tsai, Chih Fong
AU - Sung, Ya Ting
N1 - Publisher Copyright:
© 2020 Elsevier B.V.
PY - 2020/9/5
Y1 - 2020/9/5
N2 - Feature selection in high dimension, low sample size (HDLSS) data is always an important data pre-processing task. In the literature, the concept of ensemble learning has been applied to improve single feature selection methods, the so-called ensemble feature selection techniques. The most widely used approach is to combine multiple feature selection methods and their selection results via some sort of aggregation function in a parallel manner. Another ensemble strategy is based on the serial combination approach where the selection results of the first feature selection stage are used as input for the second stage of feature selection to produce the final output. The aim of this paper is to fully explore the performance of parallel and serial combination approaches for ensemble feature selection over HDLSS data. In particular, we strive to answer two research questions: whether parallel and serial based ensemble feature selection can outperform single feature selection and which combination approach is the better choice for ensemble feature selection. The experimental results based on comparing nine parallel and nine serial combinations, as well as three single baseline feature selection methods, including principal component analysis (PCA), genetic algorithm (GA), and C4.5 decision tree, show that ensemble feature selection performs better than single feature selection in terms of classification accuracy. However, there are no significant differences in performance between the single best baseline method (i.e. GA) and the top three parallel and serial combinations. On the other hand, the serial combination approach produces the largest feature reduction rate.
AB - Feature selection in high dimension, low sample size (HDLSS) data is always an important data pre-processing task. In the literature, the concept of ensemble learning has been applied to improve single feature selection methods, the so-called ensemble feature selection techniques. The most widely used approach is to combine multiple feature selection methods and their selection results via some sort of aggregation function in a parallel manner. Another ensemble strategy is based on the serial combination approach where the selection results of the first feature selection stage are used as input for the second stage of feature selection to produce the final output. The aim of this paper is to fully explore the performance of parallel and serial combination approaches for ensemble feature selection over HDLSS data. In particular, we strive to answer two research questions: whether parallel and serial based ensemble feature selection can outperform single feature selection and which combination approach is the better choice for ensemble feature selection. The experimental results based on comparing nine parallel and nine serial combinations, as well as three single baseline feature selection methods, including principal component analysis (PCA), genetic algorithm (GA), and C4.5 decision tree, show that ensemble feature selection performs better than single feature selection in terms of classification accuracy. However, there are no significant differences in performance between the single best baseline method (i.e. GA) and the top three parallel and serial combinations. On the other hand, the serial combination approach produces the largest feature reduction rate.
KW - Data mining
KW - Ensemble learning
KW - Feature selection
KW - High dimension low sample size
KW - Machine learning
UR - http://www.scopus.com/inward/record.url?scp=85085728579&partnerID=8YFLogxK
U2 - 10.1016/j.knosys.2020.106097
DO - 10.1016/j.knosys.2020.106097
M3 - 期刊論文
AN - SCOPUS:85085728579
SN - 0950-7051
VL - 203
JO - Knowledge-Based Systems
JF - Knowledge-Based Systems
M1 - 106097
ER -