DATA GOVERNANCE FRAMEWORK FOR ML-BASED, DATA-INTENSIVE DISTRIBUTED SYSTEMS

Polimeno, Antongiacomo

The data processing landscape has been fundamentally transformed in recent decades, evolving from monolithic systems to complex distributed architectures where multiple services operated by different parties collaborate to deliver sophisticated analytics capabilities. Machine Learning (ML) and Internet of Things (IoT) technologies are increasingly integrated into these systems, enabling advanced data processing pipelines that span multiple organizational boundaries. Modern distributed data processing systems support critical operations across various domains, from healthcare analytics to financial services, making their trustworthiness and compliance with data protection regulations a concern. This transformation has created significant challenges for data governance, particularly in balancing data quality preservation with privacy protection requirements. Current approaches suffer from fundamental limitations: existing governance frameworks focus primarily on static privacy configurations, neglecting the dynamic nature of distributed service environments and failing to optimize quality-privacy trade-offs systematically. Traditional access control models are inadequate for multi-dimensional optimization scenarios where data utility must be maximized while ensuring regulatory compliance. Despite increasing regulatory requirements and organizational needs for privacy-preserving data processing, there is a lack of governance frameworks that can systematically optimize quality-privacy trade-offs in modern distributed systems while maintaining computational tractability and industrial applicability. In this thesis, we propose a data governance framework that addresses this impasse. Our framework operates at the service composition layer, providing policy-driven optimization for data processing pipelines built as compositions of distributed services. It implements a multi-dimensional approach that balances data quality preservation with privacy protection through systematic service selection and dynamic adaptation mechanisms. The framework integrates seamlessly with existing distributed data processing platforms and has been validated through industrial deployment and experimental evaluation. Our contribution is manifold. First, we design and implement a policy-driven service selection framework. The framework defines three complementary quality metrics that capture different aspects of data utility preservation: a quantitative metric based on weighted Jaccard coefficients ($M_J$), a qualitative metric using Jensen-Shannon divergence ($M_{JSD}$), and an entropy-based metric ($M_H$) that quantifies information content retention. Then, we develop sliding window heuristic algorithms that address the NP-hard problem of optimal service selection by employing moving optimization windows that balance local and global objectives. Moreover, we provide a methodology for integrating our governance framework within existing distributed data processing platforms, demonstrating practical deployment through successful integration with the ALIDA (Advanced Laboratory for Interactive Data Analytics) platform. To validate our framework and demonstrate its industrial applicability, we conduct experimental evaluation across multiple datasets and deployment scenarios, proving the framework's effectiveness in maintaining data quality while ensuring privacy compliance with acceptable operational overhead.

DATA GOVERNANCE FRAMEWORK FOR ML-BASED, DATA-INTENSIVE DISTRIBUTED SYSTEMS

POLIMENO, ANTONGIACOMO

2025

Abstract

The data processing landscape has been fundamentally transformed in recent decades, evolving from monolithic systems to complex distributed architectures where multiple services operated by different parties collaborate to deliver sophisticated analytics capabilities. Machine Learning (ML) and Internet of Things (IoT) technologies are increasingly integrated into these systems, enabling advanced data processing pipelines that span multiple organizational boundaries. Modern distributed data processing systems support critical operations across various domains, from healthcare analytics to financial services, making their trustworthiness and compliance with data protection regulations a concern. This transformation has created significant challenges for data governance, particularly in balancing data quality preservation with privacy protection requirements. Current approaches suffer from fundamental limitations: existing governance frameworks focus primarily on static privacy configurations, neglecting the dynamic nature of distributed service environments and failing to optimize quality-privacy trade-offs systematically. Traditional access control models are inadequate for multi-dimensional optimization scenarios where data utility must be maximized while ensuring regulatory compliance. Despite increasing regulatory requirements and organizational needs for privacy-preserving data processing, there is a lack of governance frameworks that can systematically optimize quality-privacy trade-offs in modern distributed systems while maintaining computational tractability and industrial applicability. In this thesis, we propose a data governance framework that addresses this impasse. Our framework operates at the service composition layer, providing policy-driven optimization for data processing pipelines built as compositions of distributed services. It implements a multi-dimensional approach that balances data quality preservation with privacy protection through systematic service selection and dynamic adaptation mechanisms. The framework integrates seamlessly with existing distributed data processing platforms and has been validated through industrial deployment and experimental evaluation. Our contribution is manifold. First, we design and implement a policy-driven service selection framework. The framework defines three complementary quality metrics that capture different aspects of data utility preservation: a quantitative metric based on weighted Jaccard coefficients ($M_J$), a qualitative metric using Jensen-Shannon divergence ($M_{JSD}$), and an entropy-based metric ($M_H$) that quantifies information content retention. Then, we develop sliding window heuristic algorithms that address the NP-hard problem of optimal service selection by employing moving optimization windows that balance local and global objectives. Moreover, we provide a methodology for integrating our governance framework within existing distributed data processing platforms, demonstrating practical deployment through successful integration with the ALIDA (Advanced Laboratory for Interactive Data Analytics) platform. To validate our framework and demonstrate its industrial applicability, we conduct experimental evaluation across multiple datasets and deployment scenarios, proving the framework's effectiveness in maintaining data quality while ensuring privacy compliance with acceptable operational overhead.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				Dipartimento di Informatica Giovanni Degli Antoni
			
	Corso di studio
	
				INFORMATICA
			
	Data di pubblicazione
	
				5-dic-2025
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				ARDAGNA, CLAUDIO AGOSTINO
			
	Correlatore, Controrelatore, Co-Supervisor,  Co-Tutor o Coordinatori
	
				SASSI, ROBERTO
			
	Nome Editore
	
				Università degli Studi di Milano
			
	Numero di pagine
	
				179
			
	Collezione di appartenenza
	
				Università degli Studi di Milano

File in questo prodotto:

File	Dimensione	Formato
phd_unimi_R13812.pdf accesso aperto Licenza: Tutti i diritti riservati Dimensione 3.71 MB Formato Adobe PDF Visualizza/Apri	3.71 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/352541

Il codice NBN di questa tesi è URN:NBN:IT:UNIMI-352541