A guideline to determine the training sample size when applying big data mining methods in clinical decision making

Daniyal, Wei Jen Wang, Mu Chun Su, Si Huei Lee, Ching Sui Hung, Chun Chuan Chen

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Biomedicine is a field rich in a variety of heterogeneous, evolving, complex and unstructured massive data, coming from autonomous sources (i.e. HACE theorem). Big data mining has become the most fascinating and fastest growing area which enables the selection, exploring and modeling the vast amount of medical data to help clinical decision making, prevent medication error, and enhance patients' outcomes. Given the complexity and unstructured data nature in biomedicine, it was acknowledged that there is no single best data mining method for all applications. Indeed, an appropriate process and algorithm for big data mining is essential for obtaining a truthful result. Up to date, however, there is no guideline for this, especially about a fair sample size in the training set for reliable results. Sample size is of central importance because the biomedical data don't come cheap - they take time and human power to acquire the data and usually are very expensive. On the other hand, small sample size may result in the overestimates of the predictive accuracy by overfitting to the data. The purpose of this paper is to provide a guideline for determining the sample size that can result in a robust accuracy. Because the increment in data volume causes complexity and had a significant impact on the accuracy, we examined the relationship among sample size, data variation and performance of different data mining methods, including SVM, Naïve Bayes, Logistic Regression and J48, by using simulation and two sets of biomedical data. The simulation result revealed that the sample size can dramatically affect the performance of data mining methods under a given data variation and this effect is most manifest in nonlinear case. For experimental biomedical data, it is essential to examine the impact of sample size and data variation on the performance in order to determine the sample size.

Original languageEnglish
Title of host publicationProceedings of 4th IEEE International Conference on Applied System Innovation 2018, ICASI 2018
EditorsArtde Donald Kin-Tak Lam, Stephen D. Prior, Teen-Hang Meen
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages678-681
Number of pages4
ISBN (Electronic)9781538643426
DOIs
StatePublished - 22 Jun 2018
Event4th IEEE International Conference on Applied System Innovation, ICASI 2018 - Chiba, Japan
Duration: 13 Apr 201817 Apr 2018

Publication series

NameProceedings of 4th IEEE International Conference on Applied System Innovation 2018, ICASI 2018

Conference

Conference4th IEEE International Conference on Applied System Innovation, ICASI 2018
Country/TerritoryJapan
CityChiba
Period13/04/1817/04/18

Keywords

  • Artificial intelligence
  • Big data mining
  • Guideline
  • Heterogeneous data
  • Sample size

Fingerprint

Dive into the research topics of 'A guideline to determine the training sample size when applying big data mining methods in clinical decision making'. Together they form a unique fingerprint.

Cite this