Abstract
Data reduction is an important data pre-processing step in the KDD process. It can be approached by the application of some instance selection algorithms to filter out unrepresentative or noisy data from a given (training) dataset. However, the performance of instance selection over very high dimensional data has not yet been fully examined. In this paper, we introduce a novel efficient genetic algorithm (EGA), which fits "biological evolution" into the evolutionary process. In other words, after long-term evolution, individuals find the most efficient way to allocate resources and evolve. The experimental study is based on four very high dimensional datasets ranging from 200 to 18,236 dimensions. In addition, four state-of-the-art algorithms including IB3, DROP3, ICF, and GA are compared with EGA. The experimental results show that EGA allows the k-NN and SVM classifiers to provide the most comparable classification performance with the baseline classifiers without instance selection. Particularly, EGA outperforms the four algorithms in terms of average classification accuracy. Moreover, EGA can produce the largest reduction rates (the same as GA) and it requires relatively less computational time than the other four algorithms.
Original language | English |
---|---|
Pages (from-to) | 79-92 |
Number of pages | 14 |
Journal | Decision Support Systems |
Volume | 61 |
Issue number | 1 |
DOIs | |
State | Published - May 2014 |
Keywords
- Data mining
- Data reduction
- Genetic algorithms
- High dimensional data
- Instance selection
- Machine learning