TY - JOUR
T1 - On Combining Instance Selection and Discretisation
T2 - A Comparative Study of Two Combination Orders
AU - Sue, Kuen Liang
AU - Tsai, Chih Fong
AU - Yan, Tzu Ming
N1 - Publisher Copyright:
© World Scientific Publishing Co.
PY - 2024/10/1
Y1 - 2024/10/1
N2 - Data discretisation focuses on converting continuous attribute values to discrete ones which are closer to a knowledge-level representation that is easier to understand, use, and explain than continuous values. On the other hand, instance selection aims at filtering out noisy or unrepresentative data samples from a given training dataset before constructing a learning model. In practice, some domain datasets may require processing with both discretisation and instance selection at the same time. In such cases, the order in which discretisation and instance selection are combined will result in differences in the processed datasets. For example, discretisation can be performed first based on the original dataset, after which the instance selection algorithm is used to evaluate the discrete type of data for selection, whereas the alternative is to perform instance selection first based on the continuous type of data, then using the discretiser to transfer the attribute type of values of a reduced dataset. However, this issue has not been investigated before. The aim of this paper is to compare the performance of a classifier trained and tested over datasets processed by these combination orders. Specifically, the minimum description length principle (MDLP) and ChiMerge are used for discretisation, and IB3, DROP3 and GA for instance selection. The experimental results obtained using ten different domain datasets show that executing instance selection first and discretisation second perform the best, which can be used as the guideline for the datasets that require performing both steps. In particular, combining DROP3 and MDLP can provide classification accuracy of 0.85 and AUC of 0.8, which can be regarded as the representative baseline for future related researches.
AB - Data discretisation focuses on converting continuous attribute values to discrete ones which are closer to a knowledge-level representation that is easier to understand, use, and explain than continuous values. On the other hand, instance selection aims at filtering out noisy or unrepresentative data samples from a given training dataset before constructing a learning model. In practice, some domain datasets may require processing with both discretisation and instance selection at the same time. In such cases, the order in which discretisation and instance selection are combined will result in differences in the processed datasets. For example, discretisation can be performed first based on the original dataset, after which the instance selection algorithm is used to evaluate the discrete type of data for selection, whereas the alternative is to perform instance selection first based on the continuous type of data, then using the discretiser to transfer the attribute type of values of a reduced dataset. However, this issue has not been investigated before. The aim of this paper is to compare the performance of a classifier trained and tested over datasets processed by these combination orders. Specifically, the minimum description length principle (MDLP) and ChiMerge are used for discretisation, and IB3, DROP3 and GA for instance selection. The experimental results obtained using ten different domain datasets show that executing instance selection first and discretisation second perform the best, which can be used as the guideline for the datasets that require performing both steps. In particular, combining DROP3 and MDLP can provide classification accuracy of 0.85 and AUC of 0.8, which can be regarded as the representative baseline for future related researches.
KW - Data discretisation
KW - data mining
KW - instance selection
KW - machine learning
KW - outliers
UR - http://www.scopus.com/inward/record.url?scp=85202837114&partnerID=8YFLogxK
U2 - 10.1142/S0219649224500813
DO - 10.1142/S0219649224500813
M3 - 期刊論文
AN - SCOPUS:85202837114
SN - 0219-6492
VL - 23
JO - Journal of Information and Knowledge Management
JF - Journal of Information and Knowledge Management
IS - 5
M1 - 2450081
ER -