TY - JOUR
T1 - Machine learning model for diagnostic method prediction in parasitic disease using clinical information
AU - Lee, You Won
AU - Choi, Jae Woo
AU - Shin, Eun Hee
N1 - Publisher Copyright:
© 2021 The Author(s)
PY - 2021/12/15
Y1 - 2021/12/15
N2 - Diagnosing a parasitic disease is a very difficult job in clinical practice. In this study, we constructed a machine learning model for diagnosis prediction using patient information. First, we diagnosed whether a patient has a parasitic disease. Next, we predicted the proper diagnosis method among the six types of diagnostic terms (biopsy, endoscopy, microscopy, molecular, radiology, and serology) if the patient has a parasitic disease. To make the datasets, we extracted patient information from PubMed abstracts from 1956 to 2019. We then used two datasets: the prediction for parasite-infected patient dataset (N = 8748) and the prediction for diagnosis method dataset (N = 3780). We then compared four machine learning models: support vector machine, random forest, multi-layered perceptron, and gradient boosting. To solve the data imbalance problem, the synthetic minority over-sampling technique and TomekLinks were used. In the parasite-infected patient dataset, the random forest, random forest with synthetic minority over-sampling technique, gradient boosting, gradient boosting with synthetic minority over-sampling technique, and gradient boosting with TomekLinks demonstrated the best performances (AUC: 79%). In predicting the diagnosis method dataset, gradient boosting with synthetic minority over-sampling technique was the best model (AUC: 87%). For the class prediction, gradient boosting demonstrated the best performances in biopsy (AUC: 88%). In endoscopy (AUC: 94%), molecular (AUC: 90%), and radiology (AUC: 88%), gradient boosting with synthetic minority over-sampling technique demonstrated the best performance. Random forest demonstrated the best performances in microscopy (AUC: 82%) and serology (AUC: 85%). We calculated feature importance using gradient boosting; age was the highest feature importance. In conclusion, this study demonstrated that gradient boosting with synthetic minority over-sampling technique can predict a parasitic disease and serve as a promising diagnosis tool for binary classification and multi-classification schemes.
AB - Diagnosing a parasitic disease is a very difficult job in clinical practice. In this study, we constructed a machine learning model for diagnosis prediction using patient information. First, we diagnosed whether a patient has a parasitic disease. Next, we predicted the proper diagnosis method among the six types of diagnostic terms (biopsy, endoscopy, microscopy, molecular, radiology, and serology) if the patient has a parasitic disease. To make the datasets, we extracted patient information from PubMed abstracts from 1956 to 2019. We then used two datasets: the prediction for parasite-infected patient dataset (N = 8748) and the prediction for diagnosis method dataset (N = 3780). We then compared four machine learning models: support vector machine, random forest, multi-layered perceptron, and gradient boosting. To solve the data imbalance problem, the synthetic minority over-sampling technique and TomekLinks were used. In the parasite-infected patient dataset, the random forest, random forest with synthetic minority over-sampling technique, gradient boosting, gradient boosting with synthetic minority over-sampling technique, and gradient boosting with TomekLinks demonstrated the best performances (AUC: 79%). In predicting the diagnosis method dataset, gradient boosting with synthetic minority over-sampling technique was the best model (AUC: 87%). For the class prediction, gradient boosting demonstrated the best performances in biopsy (AUC: 88%). In endoscopy (AUC: 94%), molecular (AUC: 90%), and radiology (AUC: 88%), gradient boosting with synthetic minority over-sampling technique demonstrated the best performance. Random forest demonstrated the best performances in microscopy (AUC: 82%) and serology (AUC: 85%). We calculated feature importance using gradient boosting; age was the highest feature importance. In conclusion, this study demonstrated that gradient boosting with synthetic minority over-sampling technique can predict a parasitic disease and serve as a promising diagnosis tool for binary classification and multi-classification schemes.
KW - Binary-classification
KW - Diagnosis
KW - Machine learning
KW - Multi-classification
KW - Parasite
UR - http://www.scopus.com/inward/record.url?scp=85111544654&partnerID=8YFLogxK
U2 - 10.1016/j.eswa.2021.115658
DO - 10.1016/j.eswa.2021.115658
M3 - Article
AN - SCOPUS:85111544654
SN - 0957-4174
VL - 185
JO - Expert Systems with Applications
JF - Expert Systems with Applications
M1 - 115658
ER -