Data discretization (or discretization) is the process of transferring continuous data values into discrete ones. Data discretization can allow the data analysis results to be easily interpreted. In addition, many well-known data mining algorithms, such as C4.5/5.0 decision trees and naïve Bayes, are more suitable for handling the discrete type of data. In practice, the real world datasets usually contains some noisy data. For example, they may contain redundant or irrelevant features and outliers, which can negatively impact the final mining results. Moreover, sometimes the collected datasets are likely to have missing (attribute) values. In literature, data cleaning techniques, including feature selection, instance selection, and missing value imputation have been widely used to solve the above problems. However, it may be the case that the collected datasets require data discretization to be performed for specific mining purposes, but they also contain some noisy features, outliers, and/or missing values. As a result, both discretization and one of the three data cleaning techniques should be considered for the data pre-processing step. However, in related literatures, very few studies focused on investigating the interaction effects between discretization and the data cleaning techniques. Therefore, the aim of this three-year research project is to find out the optimal combination of discretization and each of the three types of data cleaning techniques, respectively. In other words, the research question of this research project is: whether performing discretization first and the data cleaning step second (i.e. feature selection, instance selection, and missing value imputation) or performing the data cleaning step first and discretization second can produce the best mining result?
|Effective start/end date||1/08/21 → 31/07/22|
UN Sustainable Development Goals
In 2015, UN member states agreed to 17 global Sustainable Development Goals (SDGs) to end poverty, protect the planet and ensure prosperity for all. This project contributes towards the following SDG(s):
- data discretization
- data cleaning
- feature selection
- instance selection
- missing value imputation
- data mining
- machine learning
Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.