The exponential growth of large and heterogeneous datasets has created new challenges for supervised Machine Learning, where scalability, robustness, and interpretability are increasingly required but not always achieved by traditional approaches. Existing methodologies often suffer from high computational costs, reliance on restrictive model assumptions, and limited applicability to mixed-type data, leaving open important methodological questions. This thesis addresses these issues by investigating two complementary directions: subsampling techniques for efficient data reduction and clustering based strategies as pre-processing tools for predictive modeling. In the first part, the focus is on subsampling methods, with particular attention to their limitations in supporting heterogeneous data structures. To overcome these gaps, the thesis introduces GS-Max, a novel model-agnostic algorithm that constructs space-filling subsets through the maximin criterion and the Gower distance. The method is deterministic and reproducible, ensuring representative subdata without relying on specific modeling assumptions. Simulation studies confirm its superiority in terms of geometric coverage and computational efficiency compared to both random subsampling and benchmark methods, while a simulated case study in sports analytics demonstrates that GS-Max can preserve high level of predictive accuracy of full-data training at a fraction of the computational cost. The second part investigates the role of clustering as a pre-processing step for supervised learning, a field that remains underexplored despite the increasing availability of unsupervised techniques in industrial practice. The thesis proposes two integration frameworks: Membership Informed (MI), which augments predictors with cluster membership values, and Weighted Ensemble Regression (WER), which combines clusterwise models through membership weighting. Their performance is evaluated in an industrial case study on Production Lead Time prediction, where fuzzy clustering consistently outperforms hard clustering, and the fuzzy-MI configuration emerges as the most effective solution, offering accuracy gains together with efficiency and operational simplicity. Taken together, the results demonstrate that principled subsampling and clustering-based pre-processing can substantially improve the scalability and interpretability of supervised pipelines in complex data environments. The contributions of the thesis therefore lie in extending the scope of subsampling methods to mixed-type data, and in providing new frameworks for embedding clustering information into predictive models. While limitations remain, particularly concerning high-dimensional distance concentration and the need for broader validation of clustering-enhanced strategies, the proposed methodologies open promising directions for future research, including projection-aware designs such as MaxPro and ensemble approaches to subsampling. Overall, the dissertation advances both the theoretical foundations and the practical tools required to design next-generation learning systems capable of adapting to the complexity, heterogeneity, and scale that increasingly characterize contemporary data science.

Big Data Management

MOLENA, ALBERTO
2025

Abstract

The exponential growth of large and heterogeneous datasets has created new challenges for supervised Machine Learning, where scalability, robustness, and interpretability are increasingly required but not always achieved by traditional approaches. Existing methodologies often suffer from high computational costs, reliance on restrictive model assumptions, and limited applicability to mixed-type data, leaving open important methodological questions. This thesis addresses these issues by investigating two complementary directions: subsampling techniques for efficient data reduction and clustering based strategies as pre-processing tools for predictive modeling. In the first part, the focus is on subsampling methods, with particular attention to their limitations in supporting heterogeneous data structures. To overcome these gaps, the thesis introduces GS-Max, a novel model-agnostic algorithm that constructs space-filling subsets through the maximin criterion and the Gower distance. The method is deterministic and reproducible, ensuring representative subdata without relying on specific modeling assumptions. Simulation studies confirm its superiority in terms of geometric coverage and computational efficiency compared to both random subsampling and benchmark methods, while a simulated case study in sports analytics demonstrates that GS-Max can preserve high level of predictive accuracy of full-data training at a fraction of the computational cost. The second part investigates the role of clustering as a pre-processing step for supervised learning, a field that remains underexplored despite the increasing availability of unsupervised techniques in industrial practice. The thesis proposes two integration frameworks: Membership Informed (MI), which augments predictors with cluster membership values, and Weighted Ensemble Regression (WER), which combines clusterwise models through membership weighting. Their performance is evaluated in an industrial case study on Production Lead Time prediction, where fuzzy clustering consistently outperforms hard clustering, and the fuzzy-MI configuration emerges as the most effective solution, offering accuracy gains together with efficiency and operational simplicity. Taken together, the results demonstrate that principled subsampling and clustering-based pre-processing can substantially improve the scalability and interpretability of supervised pipelines in complex data environments. The contributions of the thesis therefore lie in extending the scope of subsampling methods to mixed-type data, and in providing new frameworks for embedding clustering information into predictive models. While limitations remain, particularly concerning high-dimensional distance concentration and the need for broader validation of clustering-enhanced strategies, the proposed methodologies open promising directions for future research, including projection-aware designs such as MaxPro and ensemble approaches to subsampling. Overall, the dissertation advances both the theoretical foundations and the practical tools required to design next-generation learning systems capable of adapting to the complexity, heterogeneity, and scale that increasingly characterize contemporary data science.
3-dic-2025
Inglese
ARBORETTI GIANCRISTOFARO, ROSA
Università degli studi di Padova
File in questo prodotto:
File Dimensione Formato  
tesi_Alberto_Molena.pdf

embargo fino al 02/12/2028

Licenza: Tutti i diritti riservati
Dimensione 1.5 MB
Formato Adobe PDF
1.5 MB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/354272
Il codice NBN di questa tesi è URN:NBN:IT:UNIPD-354272