The healthcare industry is undergoing an unprecedented surge in the volume of medical data generated on a daily basis. This exponential growth is driven by several factors, including the widespread adoption of electronic health records and applications that continuously generate streams of patient data. Medical data now encompasses a diverse array of structured and unstructured formats. Considering that a significant part of medical data exists in unstructured, free-text form, data processing systems encounter significant challenges in effectively utilizing this information. Thus, assigning standardized meanings to these textual expressions becomes crucial in the fields of epidemiology, statistics, and health informatics. This standardization is critical because it enables the automated processing and analysis of data, provides the requisite information to make informed decisions, and facilitates the implementation of effective public health policies. In the medical field, this is typically achieved by coding and classifying text using appropriate terminologies and classifications. Traditionally and in many current scenarios, this process has been performed manually by trained professionals, but even when executed meticulously, it is laborious, time-consuming, and prone to human error. To assist practitioners, two main approaches have been utilized to enhance the coding process: symbolic and subsymbolic techniques. Symbolic techniques employ logical reasoning and human-defined rules to produce deterministic outcomes, while subsymbolic techniques employ machine learning and neural networks to manage extensive, noisy datasets and to learn adaptively from patterns within the data. The aim of this work is multifold and focuses on supporting the automation of clinical coding through symbolic and sub-symbolic approaches and enhancing the interpretability of sub-symbolic methods. Specifically, we presented the architecture of a novel rule-based system for the automated selection of the so-called Underlying Cause of Death (UCOD), using classification-independent rules. This was followed by a preliminary validation on two datasets of death certificates coded with the International Statistical Classification of Diseases and Related Health Problems, 10th (ICD-10) and 11th (ICD-11) revisions. Secondly, we addressed the coding of death certificates using sub-symbolic approaches, focusing on converting textual conditions into ICD-10 codes and identifying the UCOD. We proposed a novel method that outperformed state-of-the-art systems for UCOD selection from death certificates, leveraging natural language processing algorithms. Specifically, we compared various techniques applied to tabular and textual data, including logistic regression, random forest, XGBoost, feedforward neural networks with categorical embeddings, and transformers. Through extensive comparative experiments, we found that the fine-tuned Mistral model significantly outperformed other transformer models, particularly in data-limited scenarios. Thirdly, we aimed to enhance the interpretability of deep learning models by ensuring proper calibration. We introduced mechanisms to rank instances based on difficulty using variance of gradients. Furthermore, we proposed a model based on text-to-text transformers that generated human-readable explanations, learning from the rule-based system to support the explainability of deep learning models. Finally, we trained a disease-related language model by creating a pre-training corpus based on ICD-11 and tackled the challenge of linking clinical notes to SNOMED CT by leveraging Large Language Models and Retrieval-Augmented Generation.
Automatic Coding of Clinical Documents: Leveraging Symbolic and Sub-Symbolic Approaches for Enhanced Interpretability and Explainability
POPESCU, MIHAI HORIA
2025
Abstract
The healthcare industry is undergoing an unprecedented surge in the volume of medical data generated on a daily basis. This exponential growth is driven by several factors, including the widespread adoption of electronic health records and applications that continuously generate streams of patient data. Medical data now encompasses a diverse array of structured and unstructured formats. Considering that a significant part of medical data exists in unstructured, free-text form, data processing systems encounter significant challenges in effectively utilizing this information. Thus, assigning standardized meanings to these textual expressions becomes crucial in the fields of epidemiology, statistics, and health informatics. This standardization is critical because it enables the automated processing and analysis of data, provides the requisite information to make informed decisions, and facilitates the implementation of effective public health policies. In the medical field, this is typically achieved by coding and classifying text using appropriate terminologies and classifications. Traditionally and in many current scenarios, this process has been performed manually by trained professionals, but even when executed meticulously, it is laborious, time-consuming, and prone to human error. To assist practitioners, two main approaches have been utilized to enhance the coding process: symbolic and subsymbolic techniques. Symbolic techniques employ logical reasoning and human-defined rules to produce deterministic outcomes, while subsymbolic techniques employ machine learning and neural networks to manage extensive, noisy datasets and to learn adaptively from patterns within the data. The aim of this work is multifold and focuses on supporting the automation of clinical coding through symbolic and sub-symbolic approaches and enhancing the interpretability of sub-symbolic methods. Specifically, we presented the architecture of a novel rule-based system for the automated selection of the so-called Underlying Cause of Death (UCOD), using classification-independent rules. This was followed by a preliminary validation on two datasets of death certificates coded with the International Statistical Classification of Diseases and Related Health Problems, 10th (ICD-10) and 11th (ICD-11) revisions. Secondly, we addressed the coding of death certificates using sub-symbolic approaches, focusing on converting textual conditions into ICD-10 codes and identifying the UCOD. We proposed a novel method that outperformed state-of-the-art systems for UCOD selection from death certificates, leveraging natural language processing algorithms. Specifically, we compared various techniques applied to tabular and textual data, including logistic regression, random forest, XGBoost, feedforward neural networks with categorical embeddings, and transformers. Through extensive comparative experiments, we found that the fine-tuned Mistral model significantly outperformed other transformer models, particularly in data-limited scenarios. Thirdly, we aimed to enhance the interpretability of deep learning models by ensuring proper calibration. We introduced mechanisms to rank instances based on difficulty using variance of gradients. Furthermore, we proposed a model based on text-to-text transformers that generated human-readable explanations, learning from the rule-based system to support the explainability of deep learning models. Finally, we trained a disease-related language model by creating a pre-training corpus based on ICD-11 and tackled the challenge of linking clinical notes to SNOMED CT by leveraging Large Language Models and Retrieval-Augmented Generation.File | Dimensione | Formato | |
---|---|---|---|
tesi.pdf
accesso aperto
Dimensione
5.5 MB
Formato
Adobe PDF
|
5.5 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/215118
URN:NBN:IT:UNIUD-215118