TY - JOUR
T1 - Natural Language Processing for Information Extraction of Gastric Diseases and Its Application in Large-Scale Clinical Research
AU - Song, Gyuseon
AU - Chung, Su Jin
AU - Seo, Ji Yeon
AU - Yang, Sun Young
AU - Jin, Eun Hyo
AU - Chung, Goh Eun
AU - Shim, Sung Ryul
AU - Sa, Soonok
AU - Hong, Moongi Simon
AU - Kim, Kang Hyun
AU - Jang, Eunchan
AU - Lee, Chae Won
AU - Bae, Jung Ho
AU - Han, Hyun Wook
N1 - Publisher Copyright:
© 2022 by the authors. Licensee MDPI, Basel, Switzerland.
PY - 2022/6/1
Y1 - 2022/6/1
N2 - Background and Aims: The utility of clinical information from esophagogastroduo-denoscopy (EGD) reports has been limited because of its unstructured narrative format. We developed a natural language processing (NLP) pipeline that automatically extracts information about gastric diseases from unstructured EGD reports and demonstrated its applicability in clinical research. Methods: An NLP pipeline was developed using 2000 EGD and associated pathology reports that were retrieved from a single healthcare center. The pipeline extracted clinical information, including the presence, location, and size, for 10 gastric diseases from the EGD reports. It was validated with 1000 EGD reports by evaluating sensitivity, positive predictive value (PPV), accuracy, and F1 score. The pipeline was applied to 248,966 EGD reports from 2010–2019 to identify patient demographics and clinical information for 10 gastric diseases. Results: For gastritis information extraction, we achieved an overall sensitivity, PPV, accuracy, and F1 score of 0.966, 0.972, 0.996, and 0.967, respec-tively. Other gastric diseases, such as ulcers, and neoplastic diseases achieved an overall sensitivity, PPV, accuracy, and F1 score of 0.975, 0.982, 0.999, and 0.978, respectively. The study of EGD data of over 10 years revealed the demographics of patients with gastric diseases by sex and age. In addition, the study identified the extent and locations of gastritis and other gastric diseases, respectively. Conclusions: We demonstrated the feasibility of the NLP pipeline providing an automated extraction of gastric disease information from EGD reports. Incorporating the pipeline can facilitate large-scale clinical research to better understand gastric diseases.
AB - Background and Aims: The utility of clinical information from esophagogastroduo-denoscopy (EGD) reports has been limited because of its unstructured narrative format. We developed a natural language processing (NLP) pipeline that automatically extracts information about gastric diseases from unstructured EGD reports and demonstrated its applicability in clinical research. Methods: An NLP pipeline was developed using 2000 EGD and associated pathology reports that were retrieved from a single healthcare center. The pipeline extracted clinical information, including the presence, location, and size, for 10 gastric diseases from the EGD reports. It was validated with 1000 EGD reports by evaluating sensitivity, positive predictive value (PPV), accuracy, and F1 score. The pipeline was applied to 248,966 EGD reports from 2010–2019 to identify patient demographics and clinical information for 10 gastric diseases. Results: For gastritis information extraction, we achieved an overall sensitivity, PPV, accuracy, and F1 score of 0.966, 0.972, 0.996, and 0.967, respec-tively. Other gastric diseases, such as ulcers, and neoplastic diseases achieved an overall sensitivity, PPV, accuracy, and F1 score of 0.975, 0.982, 0.999, and 0.978, respectively. The study of EGD data of over 10 years revealed the demographics of patients with gastric diseases by sex and age. In addition, the study identified the extent and locations of gastritis and other gastric diseases, respectively. Conclusions: We demonstrated the feasibility of the NLP pipeline providing an automated extraction of gastric disease information from EGD reports. Incorporating the pipeline can facilitate large-scale clinical research to better understand gastric diseases.
KW - digestive system
KW - endoscopy
KW - gastritis
KW - information extraction
KW - natural language processing
UR - http://www.scopus.com/inward/record.url?scp=85131016050&partnerID=8YFLogxK
U2 - 10.3390/jcm11112967
DO - 10.3390/jcm11112967
M3 - Article
AN - SCOPUS:85131016050
VL - 11
JO - Journal of Clinical Medicine
JF - Journal of Clinical Medicine
SN - 2077-0383
IS - 11
M1 - 2967
ER -