The content of Electronic Health Records (EHRs) is hugely heterogeneous, depending on the overall health system structure. Possibly, the most present and underused unstructured type of data included in the EHRs is the free-text. Nowadays, with Machine Learning (ML), we can take advantage of automatic models to encode narratives showing performance comparable to the human ones. In this dissertation, the focus is on the investigation of ML Techniques (MLT) to get insights from free-text in clinical settings. We considered two main groups of free-text involved in clinical research. The first is composed of extensive documents like research papers or study protocols. For this group, we considered 14 Systematic Reviews (SRs), including 7,494 studies from PubMed and a whole snapshot of 233,609 trials from ClinicalTrials.gov. Pediatric EHRs compose the second group, for which we considered two sources of data: one of 6,903,035 visits from the Italian Pedianet database, and the second of 2,723 Spanish discharging notes from pediatric Emergency Departments (EDs) of nine hospitals in Nicaragua. The first contribution reported is an automatic system trained to replicate a search from specialized search engines to clinical registries. The model purposed showed very high classification performances (AUC from 93.4% to 99.9% among the 14 SRs), with the added value of a reduced amount of non-relevant studies extracted (mean of 472 and maximum of 2119 additional records compared to 572 and 2680 of the original manual extraction respectively). A comparative study to explore the effect of changing different MLT or methods to manage class imbalance is reported. A whole investigation on pediatric ED visits collected from nine hospitals in Nicaragua was reported, showing a mean accuracy in the classification of discharge diagnoses of 78.31% showing promising performance of an ML for the automatic classification of ED free-text discharge diagnoses in the Spanish language. A further contribution aimed to improve the accuracy of infectious disease detection at the population level. That is a crucial public health issue that can provide the background information necessary for the implementation of effective control strategies, such as advertising and monitoring the effectiveness of vaccination campaigns. Among the two studies reported of classify cases of Varicella-Zoster Virus and types of otitis, both the primary ML paradigms of shallow and deep models were explored. In both cases the results were highly promising; in the latter, reaching performances comparable to the human ones (Accuracy 96.59% compared with 95.91% achieved by human annotators, and balanced F1 score of 95.47% compared with 93.47%). A further relevant side goal achieved rely on the languages investigated. The international research on the use of MLTs to classify EHRs is focused on English-based datasets mainly. Hence, results on non-English databases, like the Italian Pedianet or the Spanish of ED visits considered in the dissertation are essential to assess general applicability of MLTs at a general linguistic level. Showing performances comparable to the human ones, the dissertation highlights the real possibility to start to incorporate ML systems on daily clinical practice to produce a concrete improvement in the health care processes when free-text comes into account.
Sviluppo e applicazione di tecniche di apprendimento automatico per l'analisi e la classificazione del testo in ambito clinico. Development and Application of Machine Learning Techniques for Text Analyses and Classification in Clinical Research
LANERA CORRADO
Abstract
The content of Electronic Health Records (EHRs) is hugely heterogeneous, depending on the overall health system structure. Possibly, the most present and underused unstructured type of data included in the EHRs is the free-text. Nowadays, with Machine Learning (ML), we can take advantage of automatic models to encode narratives showing performance comparable to the human ones. In this dissertation, the focus is on the investigation of ML Techniques (MLT) to get insights from free-text in clinical settings. We considered two main groups of free-text involved in clinical research. The first is composed of extensive documents like research papers or study protocols. For this group, we considered 14 Systematic Reviews (SRs), including 7,494 studies from PubMed and a whole snapshot of 233,609 trials from ClinicalTrials.gov. Pediatric EHRs compose the second group, for which we considered two sources of data: one of 6,903,035 visits from the Italian Pedianet database, and the second of 2,723 Spanish discharging notes from pediatric Emergency Departments (EDs) of nine hospitals in Nicaragua. The first contribution reported is an automatic system trained to replicate a search from specialized search engines to clinical registries. The model purposed showed very high classification performances (AUC from 93.4% to 99.9% among the 14 SRs), with the added value of a reduced amount of non-relevant studies extracted (mean of 472 and maximum of 2119 additional records compared to 572 and 2680 of the original manual extraction respectively). A comparative study to explore the effect of changing different MLT or methods to manage class imbalance is reported. A whole investigation on pediatric ED visits collected from nine hospitals in Nicaragua was reported, showing a mean accuracy in the classification of discharge diagnoses of 78.31% showing promising performance of an ML for the automatic classification of ED free-text discharge diagnoses in the Spanish language. A further contribution aimed to improve the accuracy of infectious disease detection at the population level. That is a crucial public health issue that can provide the background information necessary for the implementation of effective control strategies, such as advertising and monitoring the effectiveness of vaccination campaigns. Among the two studies reported of classify cases of Varicella-Zoster Virus and types of otitis, both the primary ML paradigms of shallow and deep models were explored. In both cases the results were highly promising; in the latter, reaching performances comparable to the human ones (Accuracy 96.59% compared with 95.91% achieved by human annotators, and balanced F1 score of 95.47% compared with 93.47%). A further relevant side goal achieved rely on the languages investigated. The international research on the use of MLTs to classify EHRs is focused on English-based datasets mainly. Hence, results on non-English databases, like the Italian Pedianet or the Spanish of ED visits considered in the dissertation are essential to assess general applicability of MLTs at a general linguistic level. Showing performances comparable to the human ones, the dissertation highlights the real possibility to start to incorporate ML systems on daily clinical practice to produce a concrete improvement in the health care processes when free-text comes into account.| File | Dimensione | Formato | |
|---|---|---|---|
|
tesi_CORRADO_LANERA.pdf
accesso solo da BNCF e BNCR
Tipologia:
Altro materiale allegato
Dimensione
3.81 MB
Formato
Adobe PDF
|
3.81 MB | Adobe PDF |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/359327
URN:NBN:IT:UNIPD-359327