Density-based clustering has become a powerful alternative in response to the limitations of traditional clustering methods, which lack a strong statistical foundation and often rely on arbitrary choice of distance measures especially when applied on complex structures of data. Unlike these methods, density-based clustering links groups to specific features of the underlying data distribution, offering a more rigorous solution to the clustering ill-posed nature. Density-based clustering is usually addressed by means of two main approaches. The parametric or model-based one, which associates clusters with the components of a mixture model, has seen a broader development due to its ability to simplify even complex data structures and sometimes reduce the computational burden. In contrast, the non-parametric, or modal, counterpart identifies clusters as regions of high density, aligning them with the modal regions of the data distribution, without assuming a predefined model. While this offers greater flexibility, it has seen less advancement in handling multidimensional and complex data. This thesis explores and extends both paradigms, comparing their applicability across different data types and highlighting their respective strengths and limitations. In particular, two main methodological contributions are presented. First, a novel method for clustering categorical and mixed-type data within the non parametric framework is introduced. Second, the parametric approach to clustering network data with overlapping labels is extended within a Bayesian framework. Finally, a further contribution consists in applying the proposed methods to the Pantheon dataset, a comprehensive and original biographical database of historical figures, showing the effectiveness of these approaches in revealing meaningful patterns in complex data.

Advances in density-based clustering for complex data

CORSINI, NOEMI
2025

Abstract

Density-based clustering has become a powerful alternative in response to the limitations of traditional clustering methods, which lack a strong statistical foundation and often rely on arbitrary choice of distance measures especially when applied on complex structures of data. Unlike these methods, density-based clustering links groups to specific features of the underlying data distribution, offering a more rigorous solution to the clustering ill-posed nature. Density-based clustering is usually addressed by means of two main approaches. The parametric or model-based one, which associates clusters with the components of a mixture model, has seen a broader development due to its ability to simplify even complex data structures and sometimes reduce the computational burden. In contrast, the non-parametric, or modal, counterpart identifies clusters as regions of high density, aligning them with the modal regions of the data distribution, without assuming a predefined model. While this offers greater flexibility, it has seen less advancement in handling multidimensional and complex data. This thesis explores and extends both paradigms, comparing their applicability across different data types and highlighting their respective strengths and limitations. In particular, two main methodological contributions are presented. First, a novel method for clustering categorical and mixed-type data within the non parametric framework is introduced. Second, the parametric approach to clustering network data with overlapping labels is extended within a Bayesian framework. Finally, a further contribution consists in applying the proposed methods to the Pantheon dataset, a comprehensive and original biographical database of historical figures, showing the effectiveness of these approaches in revealing meaningful patterns in complex data.
3-feb-2025
Inglese
MENARDI, GIOVANNA
Università degli studi di Padova
File in questo prodotto:
File Dimensione Formato  
Tesi_Definitiva_Noemi_Corsini.pdf

accesso aperto

Dimensione 30.73 MB
Formato Adobe PDF
30.73 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/193563
Il codice NBN di questa tesi è URN:NBN:IT:UNIPD-193563