TY - JOUR
T1 - Deep learning for missing value imputation of continuous data and the effect of data discretization
AU - Lin, Wei Chao
AU - Tsai, Chih Fong
AU - Zhong, Jia Rong
N1 - Publisher Copyright:
© 2021 Elsevier B.V.
PY - 2022/3/5
Y1 - 2022/3/5
N2 - Often real-world datasets are incomplete and contain some missing attribute values. Furthermore, many data mining and machine learning techniques cannot directly handle incomplete datasets. Missing value imputation is the major solution for constructing a learning model to estimate specific values to replace the missing ones. Deep learning techniques have been employed for missing value imputation and demonstrated their superiority over many other well-known imputation methods. However, very few studies have attempted to assess the imputation performance of deep learning techniques for tabular or structured data with continuous values. Moreover, the effect on the imputation results when the continuous data need to be discretized has never been examined. In this paper, two supervised deep neural networks, i.e., multilayer perceptron (MLP) and deep belief networks (DBN), are compared for missing value imputation. Moreover, two differently ordered combinations of data discretization and imputation steps are examined. The results show that MLP and DBN significantly outperform the baseline imputation methods based on the mean, KNN, CART, and SVM, with DBN performing the best. On the other hand, when considering the discretization of continuous data, the order in which the two steps are combined is not the most important, but rather, the chosen imputation algorithm. That is, the final performance is much better when using DBN for imputation, regardless of whether discretization is performed in the first or second step, than the other imputation methods.
AB - Often real-world datasets are incomplete and contain some missing attribute values. Furthermore, many data mining and machine learning techniques cannot directly handle incomplete datasets. Missing value imputation is the major solution for constructing a learning model to estimate specific values to replace the missing ones. Deep learning techniques have been employed for missing value imputation and demonstrated their superiority over many other well-known imputation methods. However, very few studies have attempted to assess the imputation performance of deep learning techniques for tabular or structured data with continuous values. Moreover, the effect on the imputation results when the continuous data need to be discretized has never been examined. In this paper, two supervised deep neural networks, i.e., multilayer perceptron (MLP) and deep belief networks (DBN), are compared for missing value imputation. Moreover, two differently ordered combinations of data discretization and imputation steps are examined. The results show that MLP and DBN significantly outperform the baseline imputation methods based on the mean, KNN, CART, and SVM, with DBN performing the best. On the other hand, when considering the discretization of continuous data, the order in which the two steps are combined is not the most important, but rather, the chosen imputation algorithm. That is, the final performance is much better when using DBN for imputation, regardless of whether discretization is performed in the first or second step, than the other imputation methods.
KW - Data discretization
KW - Data science
KW - Deep learning
KW - Machine learning
KW - Missing value imputation
UR - http://www.scopus.com/inward/record.url?scp=85122686867&partnerID=8YFLogxK
U2 - 10.1016/j.knosys.2021.108079
DO - 10.1016/j.knosys.2021.108079
M3 - 期刊論文
AN - SCOPUS:85122686867
SN - 0950-7051
VL - 239
JO - Knowledge-Based Systems
JF - Knowledge-Based Systems
M1 - 108079
ER -