Learning to detect representative data for large scale instance selection

Wei Chao Lin, Chih Fong Tsai, Shih Wen Ke, Chia Wen Hung, William Eberle

Research output: Contribution to journalArticlepeer-review

23 Scopus citations


Abstract Instance selection is an important data pre-processing step in the knowledge discovery process. However, the dataset sizes of various domain problems are usually very large, and some are even non-stationary, composed of both old data and a large amount of new data samples. Current algorithms for solving this type of scalability problem have certain limitations, meaning they require a very high computational cost over very large scale datasets during instance selection. To this end, we introduce the ReDD (Representative Data Detection) approach, which is based on outlier pattern analysis and prediction. First, a machine learning model, or detector, is used to learn the patterns of (un)representative data selected by a specific instance selection method from a small amount of training data. Then, the detector can be used to detect the rest of the large amount of training data, or newly added data. We empirically evaluate ReDD over 50 domain datasets to examine the effectiveness of the learned detector, using four very large scale datasets for validation. The experimental results show that ReDD not only reduces the computational cost nearly two or three times by three baselines, but also maintains the final classification accuracy.

Original languageEnglish
Article number9498
Pages (from-to)1-8
Number of pages8
JournalJournal of Systems and Software
StatePublished - 1 Aug 2015


  • Data mining
  • Data reduction
  • Instance selection


Dive into the research topics of 'Learning to detect representative data for large scale instance selection'. Together they form a unique fingerprint.

Cite this