Natural Language Processing (NLP) is a discipline that involves the design of methods that process text. Deep learning, and Machine Learning (ML) in general, is the discipline that studies and implements methods that learn to make predictions from data. In the last years, many different ML methods have been presented in the context of NLP. In this work we focused in par- ticular on text classification methods. Cancer registries collect pathology re- ports from clinical data sources and combine them with administrative data sources to identify cancer diagnoses in a specific area. Here we present a large scale study on deep learning methods applied to cancer pathology reports in Italian language. In this study we developed several classifiers to predict to- pography and morphology ICD-O codes. We compared classic machine learn- ing approaches, i.e. Support Vector Machine (SVM), with recent deep learn- ing techniques, i.e. Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU). Furthermore, we compared recent attention-based and hierar- chical techniques, e.g. Bidirectional Encoder Representations from Transform- ers (BERT), with a more simple hard attention method, showing that the latter is enough to perform slightly better in this specific domain.

Classification of cancer pathology reports with Deep Learning methods

2020

Abstract

Natural Language Processing (NLP) is a discipline that involves the design of methods that process text. Deep learning, and Machine Learning (ML) in general, is the discipline that studies and implements methods that learn to make predictions from data. In the last years, many different ML methods have been presented in the context of NLP. In this work we focused in par- ticular on text classification methods. Cancer registries collect pathology re- ports from clinical data sources and combine them with administrative data sources to identify cancer diagnoses in a specific area. Here we present a large scale study on deep learning methods applied to cancer pathology reports in Italian language. In this study we developed several classifiers to predict to- pography and morphology ICD-O codes. We compared classic machine learn- ing approaches, i.e. Support Vector Machine (SVM), with recent deep learn- ing techniques, i.e. Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU). Furthermore, we compared recent attention-based and hierar- chical techniques, e.g. Bidirectional Encoder Representations from Transform- ers (BERT), with a more simple hard attention method, showing that the latter is enough to perform slightly better in this specific domain.
2020
Inglese
Paolo Frasconi
Università degli Studi di Firenze
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/153117
Il codice NBN di questa tesi è URN:NBN:IT:UNIFI-153117