Multiple sclerosis (MS) is among the neurological diseases that have been most thoroughly investigated using machine learning (ML) techniques. Nevertheless, despite extensive efforts and the growing availability of longitudinal, multimodal data, ML-based approaches have yet to achieve widespread adoption into routine clinical practice. As research on ML applications in MS progresses in the big-data era, the availability of extensive, longitudinal patient datasets has significantly improved. Despite these datasets providing rich sources of information crucial for training and validating ML models, several significant challenges arise, including ensuring data quality and addressing missing values, managing class imbalance, handling multimodal data, and ensuring interpretability to foster clinician trust. Indeed, complex ML models, notably deep neural networks and sophisticated ensemble methods, often work as opaque "black-box" systems. This lack of transparency creates reluctance among clinicians, particularly when high-stakes decisions must be taken—for example, pharmacological treatment escalations based on progression predictions. Thus, a significant gap remains for the routine integration of these models into clinical workflows. Although so-called optimal approaches (e.g., optimal decision trees) provide highly interpretable models, they are computationally demanding, so that they can only handle relatively small datasets, which is at odds with the current availability of large collections of clinical data. Furthermore, to the best of our knowledge, the literature currently lacks an undersampling technique universally effective across all datasets, capable of significantly reducing dataset size while minimizing information loss. The main contribution of this thesis is the development of an Auto-ML pipeline—particularly suitable for big-data scenarios—that comprehensively addresses all stages of the modeling process, from raw data to final predictions. Specifically, leveraging data from the Italian Multiple Sclerosis Register (comprising records from approximately 80,000 patients and hundreds of thousands of clinical visits), we have explored state-of-the-art methods and developed techniques that allow us to obtain an interpretable classifier whose decision rules can be easily understood by clinicians and patients alike. The pipeline relies on a novel undersampling approach based on Support Vector Machines (SVMs), which leverages the selection of free support vectors to perform targeted, intelligent undersampling while minimizing information loss. The underlying idea is that free support vectors, by enabling SVMs to achieve performance comparable to black-box ensembles, constitute a minimal yet highly informative set of significant samples necessary for training effective classifiers. Consequently, optimal methods can be efficiently employed to generate interpretable classifiers whose performance matches that of more complex ensemble models trained on the original, larger dataset. To assess the performance of the proposed pipeline, we provide a comprehensive set of results demonstrating the validity of the approach.

Use of machine learning and data mining techniques to predict the onset of secondary progressive multiple sclerosis

MARIOSA, RAFFAELE
2025

Abstract

Multiple sclerosis (MS) is among the neurological diseases that have been most thoroughly investigated using machine learning (ML) techniques. Nevertheless, despite extensive efforts and the growing availability of longitudinal, multimodal data, ML-based approaches have yet to achieve widespread adoption into routine clinical practice. As research on ML applications in MS progresses in the big-data era, the availability of extensive, longitudinal patient datasets has significantly improved. Despite these datasets providing rich sources of information crucial for training and validating ML models, several significant challenges arise, including ensuring data quality and addressing missing values, managing class imbalance, handling multimodal data, and ensuring interpretability to foster clinician trust. Indeed, complex ML models, notably deep neural networks and sophisticated ensemble methods, often work as opaque "black-box" systems. This lack of transparency creates reluctance among clinicians, particularly when high-stakes decisions must be taken—for example, pharmacological treatment escalations based on progression predictions. Thus, a significant gap remains for the routine integration of these models into clinical workflows. Although so-called optimal approaches (e.g., optimal decision trees) provide highly interpretable models, they are computationally demanding, so that they can only handle relatively small datasets, which is at odds with the current availability of large collections of clinical data. Furthermore, to the best of our knowledge, the literature currently lacks an undersampling technique universally effective across all datasets, capable of significantly reducing dataset size while minimizing information loss. The main contribution of this thesis is the development of an Auto-ML pipeline—particularly suitable for big-data scenarios—that comprehensively addresses all stages of the modeling process, from raw data to final predictions. Specifically, leveraging data from the Italian Multiple Sclerosis Register (comprising records from approximately 80,000 patients and hundreds of thousands of clinical visits), we have explored state-of-the-art methods and developed techniques that allow us to obtain an interpretable classifier whose decision rules can be easily understood by clinicians and patients alike. The pipeline relies on a novel undersampling approach based on Support Vector Machines (SVMs), which leverages the selection of free support vectors to perform targeted, intelligent undersampling while minimizing information loss. The underlying idea is that free support vectors, by enabling SVMs to achieve performance comparable to black-box ensembles, constitute a minimal yet highly informative set of significant samples necessary for training effective classifiers. Consequently, optimal methods can be efficiently employed to generate interpretable classifiers whose performance matches that of more complex ensemble models trained on the original, larger dataset. To assess the performance of the proposed pipeline, we provide a comprehensive set of results demonstrating the validity of the approach.
19-set-2025
Inglese
PALAGI, Laura
GRASSI, Francesca
PALAGI, Laura
Università degli Studi di Roma "La Sapienza"
208
File in questo prodotto:
File Dimensione Formato  
Tesi_dottorato_Mariosa.pdf

accesso aperto

Dimensione 18.77 MB
Formato Adobe PDF
18.77 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/296457
Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA1-296457