Natural Language Processing (NLP) is a discipline that involves the design of methods that process text. Deep learning, and Machine Learning (ML) in general, is the discipline that studies and implements methods that learn to make predictions from data. In the last years, many different ML methods have been presented in the context of NLP. In this work we focused in par- ticular on text classification methods. Cancer registries collect pathology re- ports from clinical data sources and combine them with administrative data sources to identify cancer diagnoses in a specific area. Here we present a large scale study on deep learning methods applied to cancer pathology reports in Italian language. In this study we developed several classifiers to predict to- pography and morphology ICD-O codes. We compared classic machine learn- ing approaches, i.e. Support Vector Machine (SVM), with recent deep learn- ing techniques, i.e. Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU). Furthermore, we compared recent attention-based and hierar- chical techniques, e.g. Bidirectional Encoder Representations from Transform- ers (BERT), with a more simple hard attention method, showing that the latter is enough to perform slightly better in this specific domain.
Classification of cancer pathology reports with Deep Learning methods
2020
Abstract
Natural Language Processing (NLP) is a discipline that involves the design of methods that process text. Deep learning, and Machine Learning (ML) in general, is the discipline that studies and implements methods that learn to make predictions from data. In the last years, many different ML methods have been presented in the context of NLP. In this work we focused in par- ticular on text classification methods. Cancer registries collect pathology re- ports from clinical data sources and combine them with administrative data sources to identify cancer diagnoses in a specific area. Here we present a large scale study on deep learning methods applied to cancer pathology reports in Italian language. In this study we developed several classifiers to predict to- pography and morphology ICD-O codes. We compared classic machine learn- ing approaches, i.e. Support Vector Machine (SVM), with recent deep learn- ing techniques, i.e. Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU). Furthermore, we compared recent attention-based and hierar- chical techniques, e.g. Bidirectional Encoder Representations from Transform- ers (BERT), with a more simple hard attention method, showing that the latter is enough to perform slightly better in this specific domain.I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/153117
URN:NBN:IT:UNIFI-153117