TY - JOUR
T1 - Empirical comparison of supervised learning techniques for missing value imputation
AU - Tsai, Chih Fong
AU - Hu, Ya Han
N1 - Publisher Copyright:
© 2022, The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature.
PY - 2022/4
Y1 - 2022/4
N2 - Many data mining algorithms cannot handle incomplete datasets where some data samples are missing attribute values. To solve this problem, missing value imputation is usually conducted and commonly based on reasoning from observed data or complete data to provide estimated replacements for missing values. In general, missing imputation methods can be classified into statistical and machine learning methods. The statistical methods are usually based on the mean for continuous attributes or mode for discrete attributes, whereas the machine learning methods are based on supervised learning techniques. However, which machine learning method performs optimally for missing value imputation is unknown. This paper compares five well-known supervised learning techniques, namely k-nearest neighbor, the multilayer perceptron neural network (MLP), the classification and regression tree (CART), naïve Bayes, and the support vector machine, to examine their imputation results for categorical, numerical, and mixed data types. The experimental results demonstrate that CART outperforms the other methods for categorical datasets, whereas the MLP is optimal for numerical and mixed datasets in terms of classification accuracy. However, when computational cost is a factor, CART is superior to the MLP because CART can provide reasonably accurate imputation results and requires the least amount of time to perform missing value imputation. Moreover, CART generates the lowest root-mean-squared error of all methods.
AB - Many data mining algorithms cannot handle incomplete datasets where some data samples are missing attribute values. To solve this problem, missing value imputation is usually conducted and commonly based on reasoning from observed data or complete data to provide estimated replacements for missing values. In general, missing imputation methods can be classified into statistical and machine learning methods. The statistical methods are usually based on the mean for continuous attributes or mode for discrete attributes, whereas the machine learning methods are based on supervised learning techniques. However, which machine learning method performs optimally for missing value imputation is unknown. This paper compares five well-known supervised learning techniques, namely k-nearest neighbor, the multilayer perceptron neural network (MLP), the classification and regression tree (CART), naïve Bayes, and the support vector machine, to examine their imputation results for categorical, numerical, and mixed data types. The experimental results demonstrate that CART outperforms the other methods for categorical datasets, whereas the MLP is optimal for numerical and mixed datasets in terms of classification accuracy. However, when computational cost is a factor, CART is superior to the MLP because CART can provide reasonably accurate imputation results and requires the least amount of time to perform missing value imputation. Moreover, CART generates the lowest root-mean-squared error of all methods.
KW - Data mining
KW - Data preprocessing
KW - Imputation
KW - Incomplete dataset
KW - Missing value
KW - Supervised learning
UR - http://www.scopus.com/inward/record.url?scp=85126354678&partnerID=8YFLogxK
U2 - 10.1007/s10115-022-01661-0
DO - 10.1007/s10115-022-01661-0
M3 - 期刊論文
AN - SCOPUS:85126354678
VL - 64
SP - 1047
EP - 1075
JO - Knowledge and Information Systems
JF - Knowledge and Information Systems
SN - 0219-1377
IS - 4
ER -