Use of machine learning and data mining techniques to predict the onset of secondary progressive multiple sclerosis

Mariosa, Raffaele

Multiple sclerosis (MS) is among the neurological diseases that have been most thoroughly investigated using machine learning (ML) techniques. Nevertheless, despite extensive efforts and the growing availability of longitudinal, multimodal data, ML-based approaches have yet to achieve widespread adoption into routine clinical practice. As research on ML applications in MS progresses in the big-data era, the availability of extensive, longitudinal patient datasets has significantly improved. Despite these datasets providing rich sources of information crucial for training and validating ML models, several significant challenges arise, including ensuring data quality and addressing missing values, managing class imbalance, handling multimodal data, and ensuring interpretability to foster clinician trust. Indeed, complex ML models, notably deep neural networks and sophisticated ensemble methods, often work as opaque "black-box" systems. This lack of transparency creates reluctance among clinicians, particularly when high-stakes decisions must be taken—for example, pharmacological treatment escalations based on progression predictions. Thus, a significant gap remains for the routine integration of these models into clinical workflows. Although so-called optimal approaches (e.g., optimal decision trees) provide highly interpretable models, they are computationally demanding, so that they can only handle relatively small datasets, which is at odds with the current availability of large collections of clinical data. Furthermore, to the best of our knowledge, the literature currently lacks an undersampling technique universally effective across all datasets, capable of significantly reducing dataset size while minimizing information loss. The main contribution of this thesis is the development of an Auto-ML pipeline—particularly suitable for big-data scenarios—that comprehensively addresses all stages of the modeling process, from raw data to final predictions. Specifically, leveraging data from the Italian Multiple Sclerosis Register (comprising records from approximately 80,000 patients and hundreds of thousands of clinical visits), we have explored state-of-the-art methods and developed techniques that allow us to obtain an interpretable classifier whose decision rules can be easily understood by clinicians and patients alike. The pipeline relies on a novel undersampling approach based on Support Vector Machines (SVMs), which leverages the selection of free support vectors to perform targeted, intelligent undersampling while minimizing information loss. The underlying idea is that free support vectors, by enabling SVMs to achieve performance comparable to black-box ensembles, constitute a minimal yet highly informative set of significant samples necessary for training effective classifiers. Consequently, optimal methods can be efficiently employed to generate interpretable classifiers whose performance matches that of more complex ensemble models trained on the original, larger dataset. To assess the performance of the proposed pipeline, we provide a comprehensive set of results demonstrating the validity of the approach.

Use of machine learning and data mining techniques to predict the onset of secondary progressive multiple sclerosis

MARIOSA, RAFFAELE

2025

Abstract

Multiple sclerosis (MS) is among the neurological diseases that have been most thoroughly investigated using machine learning (ML) techniques. Nevertheless, despite extensive efforts and the growing availability of longitudinal, multimodal data, ML-based approaches have yet to achieve widespread adoption into routine clinical practice. As research on ML applications in MS progresses in the big-data era, the availability of extensive, longitudinal patient datasets has significantly improved. Despite these datasets providing rich sources of information crucial for training and validating ML models, several significant challenges arise, including ensuring data quality and addressing missing values, managing class imbalance, handling multimodal data, and ensuring interpretability to foster clinician trust. Indeed, complex ML models, notably deep neural networks and sophisticated ensemble methods, often work as opaque "black-box" systems. This lack of transparency creates reluctance among clinicians, particularly when high-stakes decisions must be taken—for example, pharmacological treatment escalations based on progression predictions. Thus, a significant gap remains for the routine integration of these models into clinical workflows. Although so-called optimal approaches (e.g., optimal decision trees) provide highly interpretable models, they are computationally demanding, so that they can only handle relatively small datasets, which is at odds with the current availability of large collections of clinical data. Furthermore, to the best of our knowledge, the literature currently lacks an undersampling technique universally effective across all datasets, capable of significantly reducing dataset size while minimizing information loss. The main contribution of this thesis is the development of an Auto-ML pipeline—particularly suitable for big-data scenarios—that comprehensively addresses all stages of the modeling process, from raw data to final predictions. Specifically, leveraging data from the Italian Multiple Sclerosis Register (comprising records from approximately 80,000 patients and hundreds of thousands of clinical visits), we have explored state-of-the-art methods and developed techniques that allow us to obtain an interpretable classifier whose decision rules can be easily understood by clinicians and patients alike. The pipeline relies on a novel undersampling approach based on Support Vector Machines (SVMs), which leverages the selection of free support vectors to perform targeted, intelligent undersampling while minimizing information loss. The underlying idea is that free support vectors, by enabling SVMs to achieve performance comparable to black-box ensembles, constitute a minimal yet highly informative set of significant samples necessary for training effective classifiers. Consequently, optimal methods can be efficiently employed to generate interpretable classifiers whose performance matches that of more complex ensemble models trained on the original, larger dataset. To assess the performance of the proposed pipeline, we provide a comprehensive set of results demonstrating the validity of the approach.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				DIPARTIMENTO DI INGEGNERIA INFORMATICA, AUTOMATICA E GESTIONALE -ANTONIO RUBERTI-
			
	Corso di studio
	
				Automatica, bioingegneria e ricerca operativa - Abro
			
	Data di pubblicazione
	
				19-set-2025
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				PALAGI, Laura
GRASSI, Francesca
			
	Correlatore, Controrelatore, Co-Supervisor,  Co-Tutor o Coordinatori
	
				PALAGI, Laura
			
	Nome Editore
	
				Università degli Studi di Roma "La Sapienza"
			
	Numero di pagine
	
				208
			
	Collezione di appartenenza
	
				Università degli Studi di Roma La Sapienza

File in questo prodotto:

File	Dimensione	Formato
Tesi_dottorato_Mariosa.pdf accesso aperto Licenza: Tutti i diritti riservati Dimensione 18.77 MB Formato Adobe PDF Visualizza/Apri	18.77 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/296457

Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA1-296457