Empirical comparison of supervised learning techniques for missing value imputation

Research output: Contribution to journalArticlepeer-review

17 Scopus citations

Abstract

Many data mining algorithms cannot handle incomplete datasets where some data samples are missing attribute values. To solve this problem, missing value imputation is usually conducted and commonly based on reasoning from observed data or complete data to provide estimated replacements for missing values. In general, missing imputation methods can be classified into statistical and machine learning methods. The statistical methods are usually based on the mean for continuous attributes or mode for discrete attributes, whereas the machine learning methods are based on supervised learning techniques. However, which machine learning method performs optimally for missing value imputation is unknown. This paper compares five well-known supervised learning techniques, namely k-nearest neighbor, the multilayer perceptron neural network (MLP), the classification and regression tree (CART), naïve Bayes, and the support vector machine, to examine their imputation results for categorical, numerical, and mixed data types. The experimental results demonstrate that CART outperforms the other methods for categorical datasets, whereas the MLP is optimal for numerical and mixed datasets in terms of classification accuracy. However, when computational cost is a factor, CART is superior to the MLP because CART can provide reasonably accurate imputation results and requires the least amount of time to perform missing value imputation. Moreover, CART generates the lowest root-mean-squared error of all methods.

Original languageEnglish
Pages (from-to)1047-1075
Number of pages29
JournalKnowledge and Information Systems
Volume64
Issue number4
DOIs
StatePublished - Apr 2022

Keywords

  • Data mining
  • Data preprocessing
  • Imputation
  • Incomplete dataset
  • Missing value
  • Supervised learning

Fingerprint

Dive into the research topics of 'Empirical comparison of supervised learning techniques for missing value imputation'. Together they form a unique fingerprint.

Cite this