Big Data Management

Molena, Alberto

The exponential growth of large and heterogeneous datasets has created new challenges for supervised Machine Learning, where scalability, robustness, and interpretability are increasingly required but not always achieved by traditional approaches. Existing methodologies often suffer from high computational costs, reliance on restrictive model assumptions, and limited applicability to mixed-type data, leaving open important methodological questions. This thesis addresses these issues by investigating two complementary directions: subsampling techniques for efficient data reduction and clustering based strategies as pre-processing tools for predictive modeling. In the first part, the focus is on subsampling methods, with particular attention to their limitations in supporting heterogeneous data structures. To overcome these gaps, the thesis introduces GS-Max, a novel model-agnostic algorithm that constructs space-filling subsets through the maximin criterion and the Gower distance. The method is deterministic and reproducible, ensuring representative subdata without relying on specific modeling assumptions. Simulation studies confirm its superiority in terms of geometric coverage and computational efficiency compared to both random subsampling and benchmark methods, while a simulated case study in sports analytics demonstrates that GS-Max can preserve high level of predictive accuracy of full-data training at a fraction of the computational cost. The second part investigates the role of clustering as a pre-processing step for supervised learning, a field that remains underexplored despite the increasing availability of unsupervised techniques in industrial practice. The thesis proposes two integration frameworks: Membership Informed (MI), which augments predictors with cluster membership values, and Weighted Ensemble Regression (WER), which combines clusterwise models through membership weighting. Their performance is evaluated in an industrial case study on Production Lead Time prediction, where fuzzy clustering consistently outperforms hard clustering, and the fuzzy-MI configuration emerges as the most effective solution, offering accuracy gains together with efficiency and operational simplicity. Taken together, the results demonstrate that principled subsampling and clustering-based pre-processing can substantially improve the scalability and interpretability of supervised pipelines in complex data environments. The contributions of the thesis therefore lie in extending the scope of subsampling methods to mixed-type data, and in providing new frameworks for embedding clustering information into predictive models. While limitations remain, particularly concerning high-dimensional distance concentration and the need for broader validation of clustering-enhanced strategies, the proposed methodologies open promising directions for future research, including projection-aware designs such as MaxPro and ensemble approaches to subsampling. Overall, the dissertation advances both the theoretical foundations and the practical tools required to design next-generation learning systems capable of adapting to the complexity, heterogeneity, and scale that increasingly characterize contemporary data science.

Big Data Management

MOLENA, ALBERTO

2025

Abstract

The exponential growth of large and heterogeneous datasets has created new challenges for supervised Machine Learning, where scalability, robustness, and interpretability are increasingly required but not always achieved by traditional approaches. Existing methodologies often suffer from high computational costs, reliance on restrictive model assumptions, and limited applicability to mixed-type data, leaving open important methodological questions. This thesis addresses these issues by investigating two complementary directions: subsampling techniques for efficient data reduction and clustering based strategies as pre-processing tools for predictive modeling. In the first part, the focus is on subsampling methods, with particular attention to their limitations in supporting heterogeneous data structures. To overcome these gaps, the thesis introduces GS-Max, a novel model-agnostic algorithm that constructs space-filling subsets through the maximin criterion and the Gower distance. The method is deterministic and reproducible, ensuring representative subdata without relying on specific modeling assumptions. Simulation studies confirm its superiority in terms of geometric coverage and computational efficiency compared to both random subsampling and benchmark methods, while a simulated case study in sports analytics demonstrates that GS-Max can preserve high level of predictive accuracy of full-data training at a fraction of the computational cost. The second part investigates the role of clustering as a pre-processing step for supervised learning, a field that remains underexplored despite the increasing availability of unsupervised techniques in industrial practice. The thesis proposes two integration frameworks: Membership Informed (MI), which augments predictors with cluster membership values, and Weighted Ensemble Regression (WER), which combines clusterwise models through membership weighting. Their performance is evaluated in an industrial case study on Production Lead Time prediction, where fuzzy clustering consistently outperforms hard clustering, and the fuzzy-MI configuration emerges as the most effective solution, offering accuracy gains together with efficiency and operational simplicity. Taken together, the results demonstrate that principled subsampling and clustering-based pre-processing can substantially improve the scalability and interpretability of supervised pipelines in complex data environments. The contributions of the thesis therefore lie in extending the scope of subsampling methods to mixed-type data, and in providing new frameworks for embedding clustering information into predictive models. While limitations remain, particularly concerning high-dimensional distance concentration and the need for broader validation of clustering-enhanced strategies, the proposed methodologies open promising directions for future research, including projection-aware designs such as MaxPro and ensemble approaches to subsampling. Overall, the dissertation advances both the theoretical foundations and the practical tools required to design next-generation learning systems capable of adapting to the complexity, heterogeneity, and scale that increasingly characterize contemporary data science.

Scheda breve

Scheda completa

Scheda completa (DC)

	Corso di studio
	
				INGEGNERIA ECONOMICO GESTIONALE
			
	Data di pubblicazione
	
				3-dic-2025
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				ARBORETTI  GIANCRISTOFARO, ROSA
			
	Nome Editore
	
				Università degli studi di Padova
			
	Collezione di appartenenza
	
				Università degli Studi di Padova

File in questo prodotto:

File	Dimensione	Formato
tesi_Alberto_Molena.pdf embargo fino al 02/12/2028 Licenza: Tutti i diritti riservati Dimensione 1.5 MB Formato Adobe PDF	1.5 MB	Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/354272

Il codice NBN di questa tesi è URN:NBN:IT:UNIPD-354272