TY - GEN
T1 - A guideline to determine the training sample size when applying big data mining methods in clinical decision making
AU - Daniyal,
AU - Wang, Wei Jen
AU - Su, Mu Chun
AU - Lee, Si Huei
AU - Hung, Ching Sui
AU - Chen, Chun Chuan
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/6/22
Y1 - 2018/6/22
N2 - Biomedicine is a field rich in a variety of heterogeneous, evolving, complex and unstructured massive data, coming from autonomous sources (i.e. HACE theorem). Big data mining has become the most fascinating and fastest growing area which enables the selection, exploring and modeling the vast amount of medical data to help clinical decision making, prevent medication error, and enhance patients' outcomes. Given the complexity and unstructured data nature in biomedicine, it was acknowledged that there is no single best data mining method for all applications. Indeed, an appropriate process and algorithm for big data mining is essential for obtaining a truthful result. Up to date, however, there is no guideline for this, especially about a fair sample size in the training set for reliable results. Sample size is of central importance because the biomedical data don't come cheap - they take time and human power to acquire the data and usually are very expensive. On the other hand, small sample size may result in the overestimates of the predictive accuracy by overfitting to the data. The purpose of this paper is to provide a guideline for determining the sample size that can result in a robust accuracy. Because the increment in data volume causes complexity and had a significant impact on the accuracy, we examined the relationship among sample size, data variation and performance of different data mining methods, including SVM, Naïve Bayes, Logistic Regression and J48, by using simulation and two sets of biomedical data. The simulation result revealed that the sample size can dramatically affect the performance of data mining methods under a given data variation and this effect is most manifest in nonlinear case. For experimental biomedical data, it is essential to examine the impact of sample size and data variation on the performance in order to determine the sample size.
AB - Biomedicine is a field rich in a variety of heterogeneous, evolving, complex and unstructured massive data, coming from autonomous sources (i.e. HACE theorem). Big data mining has become the most fascinating and fastest growing area which enables the selection, exploring and modeling the vast amount of medical data to help clinical decision making, prevent medication error, and enhance patients' outcomes. Given the complexity and unstructured data nature in biomedicine, it was acknowledged that there is no single best data mining method for all applications. Indeed, an appropriate process and algorithm for big data mining is essential for obtaining a truthful result. Up to date, however, there is no guideline for this, especially about a fair sample size in the training set for reliable results. Sample size is of central importance because the biomedical data don't come cheap - they take time and human power to acquire the data and usually are very expensive. On the other hand, small sample size may result in the overestimates of the predictive accuracy by overfitting to the data. The purpose of this paper is to provide a guideline for determining the sample size that can result in a robust accuracy. Because the increment in data volume causes complexity and had a significant impact on the accuracy, we examined the relationship among sample size, data variation and performance of different data mining methods, including SVM, Naïve Bayes, Logistic Regression and J48, by using simulation and two sets of biomedical data. The simulation result revealed that the sample size can dramatically affect the performance of data mining methods under a given data variation and this effect is most manifest in nonlinear case. For experimental biomedical data, it is essential to examine the impact of sample size and data variation on the performance in order to determine the sample size.
KW - Artificial intelligence
KW - Big data mining
KW - Guideline
KW - Heterogeneous data
KW - Sample size
UR - http://www.scopus.com/inward/record.url?scp=85050292515&partnerID=8YFLogxK
U2 - 10.1109/ICASI.2018.8394347
DO - 10.1109/ICASI.2018.8394347
M3 - 會議論文篇章
AN - SCOPUS:85050292515
T3 - Proceedings of 4th IEEE International Conference on Applied System Innovation 2018, ICASI 2018
SP - 678
EP - 681
BT - Proceedings of 4th IEEE International Conference on Applied System Innovation 2018, ICASI 2018
A2 - Lam, Artde Donald Kin-Tak
A2 - Prior, Stephen D.
A2 - Meen, Teen-Hang
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 4th IEEE International Conference on Applied System Innovation, ICASI 2018
Y2 - 13 April 2018 through 17 April 2018
ER -