The data processing landscape has been fundamentally transformed in recent decades, evolving from monolithic systems to complex distributed architectures where multiple services operated by different parties collaborate to deliver sophisticated analytics capabilities. Machine Learning (ML) and Internet of Things (IoT) technologies are increasingly integrated into these systems, enabling advanced data processing pipelines that span multiple organizational boundaries. Modern distributed data processing systems support critical operations across various domains, from healthcare analytics to financial services, making their trustworthiness and compliance with data protection regulations a concern. This transformation has created significant challenges for data governance, particularly in balancing data quality preservation with privacy protection requirements. Current approaches suffer from fundamental limitations: existing governance frameworks focus primarily on static privacy configurations, neglecting the dynamic nature of distributed service environments and failing to optimize quality-privacy trade-offs systematically. Traditional access control models are inadequate for multi-dimensional optimization scenarios where data utility must be maximized while ensuring regulatory compliance. Despite increasing regulatory requirements and organizational needs for privacy-preserving data processing, there is a lack of governance frameworks that can systematically optimize quality-privacy trade-offs in modern distributed systems while maintaining computational tractability and industrial applicability. In this thesis, we propose a data governance framework that addresses this impasse. Our framework operates at the service composition layer, providing policy-driven optimization for data processing pipelines built as compositions of distributed services. It implements a multi-dimensional approach that balances data quality preservation with privacy protection through systematic service selection and dynamic adaptation mechanisms. The framework integrates seamlessly with existing distributed data processing platforms and has been validated through industrial deployment and experimental evaluation. Our contribution is manifold. First, we design and implement a policy-driven service selection framework. The framework defines three complementary quality metrics that capture different aspects of data utility preservation: a quantitative metric based on weighted Jaccard coefficients ($M_J$), a qualitative metric using Jensen-Shannon divergence ($M_{JSD}$), and an entropy-based metric ($M_H$) that quantifies information content retention. Then, we develop sliding window heuristic algorithms that address the NP-hard problem of optimal service selection by employing moving optimization windows that balance local and global objectives. Moreover, we provide a methodology for integrating our governance framework within existing distributed data processing platforms, demonstrating practical deployment through successful integration with the ALIDA (Advanced Laboratory for Interactive Data Analytics) platform. To validate our framework and demonstrate its industrial applicability, we conduct experimental evaluation across multiple datasets and deployment scenarios, proving the framework's effectiveness in maintaining data quality while ensuring privacy compliance with acceptable operational overhead.
DATA GOVERNANCE FRAMEWORK FOR ML-BASED, DATA-INTENSIVE DISTRIBUTED SYSTEMS
POLIMENO, ANTONGIACOMO
2025
Abstract
The data processing landscape has been fundamentally transformed in recent decades, evolving from monolithic systems to complex distributed architectures where multiple services operated by different parties collaborate to deliver sophisticated analytics capabilities. Machine Learning (ML) and Internet of Things (IoT) technologies are increasingly integrated into these systems, enabling advanced data processing pipelines that span multiple organizational boundaries. Modern distributed data processing systems support critical operations across various domains, from healthcare analytics to financial services, making their trustworthiness and compliance with data protection regulations a concern. This transformation has created significant challenges for data governance, particularly in balancing data quality preservation with privacy protection requirements. Current approaches suffer from fundamental limitations: existing governance frameworks focus primarily on static privacy configurations, neglecting the dynamic nature of distributed service environments and failing to optimize quality-privacy trade-offs systematically. Traditional access control models are inadequate for multi-dimensional optimization scenarios where data utility must be maximized while ensuring regulatory compliance. Despite increasing regulatory requirements and organizational needs for privacy-preserving data processing, there is a lack of governance frameworks that can systematically optimize quality-privacy trade-offs in modern distributed systems while maintaining computational tractability and industrial applicability. In this thesis, we propose a data governance framework that addresses this impasse. Our framework operates at the service composition layer, providing policy-driven optimization for data processing pipelines built as compositions of distributed services. It implements a multi-dimensional approach that balances data quality preservation with privacy protection through systematic service selection and dynamic adaptation mechanisms. The framework integrates seamlessly with existing distributed data processing platforms and has been validated through industrial deployment and experimental evaluation. Our contribution is manifold. First, we design and implement a policy-driven service selection framework. The framework defines three complementary quality metrics that capture different aspects of data utility preservation: a quantitative metric based on weighted Jaccard coefficients ($M_J$), a qualitative metric using Jensen-Shannon divergence ($M_{JSD}$), and an entropy-based metric ($M_H$) that quantifies information content retention. Then, we develop sliding window heuristic algorithms that address the NP-hard problem of optimal service selection by employing moving optimization windows that balance local and global objectives. Moreover, we provide a methodology for integrating our governance framework within existing distributed data processing platforms, demonstrating practical deployment through successful integration with the ALIDA (Advanced Laboratory for Interactive Data Analytics) platform. To validate our framework and demonstrate its industrial applicability, we conduct experimental evaluation across multiple datasets and deployment scenarios, proving the framework's effectiveness in maintaining data quality while ensuring privacy compliance with acceptable operational overhead.| File | Dimensione | Formato | |
|---|---|---|---|
|
phd_unimi_R13812.pdf
accesso aperto
Licenza:
Tutti i diritti riservati
Dimensione
3.71 MB
Formato
Adobe PDF
|
3.71 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/352541
URN:NBN:IT:UNIMI-352541