A guideline to determine the training sample size when applying big data mining methods in clinical decision making

Daniyal, Wei Jen Wang, Mu Chun Su, Si Huei Lee, Ching Sui Hung, Chun Chuan Chen

研究成果: 書貢獻/報告類型會議論文篇章同行評審

1 引文 斯高帕斯(Scopus)

摘要

Biomedicine is a field rich in a variety of heterogeneous, evolving, complex and unstructured massive data, coming from autonomous sources (i.e. HACE theorem). Big data mining has become the most fascinating and fastest growing area which enables the selection, exploring and modeling the vast amount of medical data to help clinical decision making, prevent medication error, and enhance patients' outcomes. Given the complexity and unstructured data nature in biomedicine, it was acknowledged that there is no single best data mining method for all applications. Indeed, an appropriate process and algorithm for big data mining is essential for obtaining a truthful result. Up to date, however, there is no guideline for this, especially about a fair sample size in the training set for reliable results. Sample size is of central importance because the biomedical data don't come cheap - they take time and human power to acquire the data and usually are very expensive. On the other hand, small sample size may result in the overestimates of the predictive accuracy by overfitting to the data. The purpose of this paper is to provide a guideline for determining the sample size that can result in a robust accuracy. Because the increment in data volume causes complexity and had a significant impact on the accuracy, we examined the relationship among sample size, data variation and performance of different data mining methods, including SVM, Naïve Bayes, Logistic Regression and J48, by using simulation and two sets of biomedical data. The simulation result revealed that the sample size can dramatically affect the performance of data mining methods under a given data variation and this effect is most manifest in nonlinear case. For experimental biomedical data, it is essential to examine the impact of sample size and data variation on the performance in order to determine the sample size.

原文???core.languages.en_GB???
主出版物標題Proceedings of 4th IEEE International Conference on Applied System Innovation 2018, ICASI 2018
編輯Artde Donald Kin-Tak Lam, Stephen D. Prior, Teen-Hang Meen
發行者Institute of Electrical and Electronics Engineers Inc.
頁面678-681
頁數4
ISBN(電子)9781538643426
DOIs
出版狀態已出版 - 22 6月 2018
事件4th IEEE International Conference on Applied System Innovation, ICASI 2018 - Chiba, Japan
持續時間: 13 4月 201817 4月 2018

出版系列

名字Proceedings of 4th IEEE International Conference on Applied System Innovation 2018, ICASI 2018

???event.eventtypes.event.conference???

???event.eventtypes.event.conference???4th IEEE International Conference on Applied System Innovation, ICASI 2018
國家/地區Japan
城市Chiba
期間13/04/1817/04/18

指紋

深入研究「A guideline to determine the training sample size when applying big data mining methods in clinical decision making」主題。共同形成了獨特的指紋。

引用此