In recent years, the explosive growth of data across various domains has underscored the urgent need for advanced statistical methodologies capable of capturing latent structures, managing high-dimensionality, and yielding interpretable insights. Traditional clustering and factor analysis techniques often prove inadequate in these contexts due to their limited flexibility, sensitivity to noise and outliers, and inability to uncover complex, overlapping, or hierarchical patterns. This thesis addresses these challenges by introducing novel multivariate methods focused on unsupervised learning, simultaneous clustering and dimensionality reduction, and latent structure modeling. The research conducted throughout this work has resulted in ten scientific papers, all of which have been either published or are currently under review in leading international journals. Among these, the five most significant contributions are discussed in detail in this thesis, while the other works—three already published, two under second-round review and one recently submitted—are also briefly referenced to provide a comprehensive overview of the research outcomes. Each thesis' chapter offers new methodological developments tailored to a distinct data structure or problem setting, while collectively reinforcing a broader vision: developing tools for flexible, interpretable, and robust analysis of high-dimensional data. The first two contributions investigate the connection between clustering and ranking of multivariate observations through Linear Ordered Partitions. Unlike traditional clustering that partitions units devoid of inherent order, the first model identifies clusters as equivalence classes ordered along a latent univariate dimension, thus yielding an optimal ranked partition of the data. The proposed method utilizes a constrained Factorial K-Means model. Complementing this, a second bootstrap-based methodology for clustering and ranking in a univariate context is also briefly introduced. It is named Cluster Ranking via Bootstrap $ K-Means, and it aligns with the Linear Ordered Partitions framework by constructing ranked equivalence classes focused on univariate data. This model identifies the maximum number of statistically distinct clusters, within which units are considered equivalent and differences are observed across clusters. Ranking is achieved through bootstrap confidence intervals for K-Means centroids, enabling both estimation of the optimal number of clusters and ranking within and across equivalence classes. Building upon integrating latent structure and clustering, we introduce the Fuzzy Reduced K-Means. Unlike standard fuzzy clustering, this model incorporates dimensionality reduction into the clustering process, unveiling latent dimensions that shape data structure while permitting observations to belong to multiple clusters with varying degrees of membership. Such flexibility is crucial for capturing the overlapping and multifaceted nature of real-world phenomena. Expanding upon this idea, the Generalized Reduced K-Means model is presented. It extends the Reduced K-Means framework by allowing different clusters to be associated with distinct latent subspaces. This model is particularly suited to scenarios where data dimensions contribute diversely to various subgroups. Attention then shifts to hierarchical modeling of latent structures. A foundational method for achieving such hierarchies is Structural Equation Modeling, utilized in a minor work to model complex interrelationships among multiple air pollutants and their determinants. Building on this, the innovative Ultrametric Factor Analysis model is introduced. It reconstructs the correlation matrix of Manifest Variables through a nested hierarchy of latent factors, forming an ultrametric tree. Unlike traditional factor analysis, this approach uncovers a unique and interpretable latent hierarchy, offering deep insights into underlying dimensions. Several work has been done in three-way setting. The Tucker3 model has been applied to socio-economic data and has been extended considering a disjoint and interpretable version. Moreover, a fuzzy entropic Triple K-Means model has been proposed. Shortly summarized in the Introduction, a third work simultaneously integrates clustering and latent variables modeling, softly partitioning the occasion mode of a unit-by-variable-by-occasion array into K clusters, producing K consensus matrices. Each consensus is analyzed with a Second-Order Disjoint Factor Analysis to extract first-order factors and a single General Factor, providing a compact, interpretable representation of the J variables within each cluster. The fourth work that considers three-way data, which is the one illustrated more in detail in this thesis, explores clustering each of the three modes of a data array within a robustness framework. It introduces dimension-wise and cell-wise robust extensions of the Triple K-Means algorithm, allowing for outlier detection and trimming at various levels: entire units, variables, or occasions (dimension-wise), or individual data cells (cell-wise). These methods, based on a trimmed Alternating Least-Squares optimization framework, incorporate a data-driven selection mechanism for outlier detection using the elbow method applied to second-order derivatives. Collectively, these ten studies contribute a cohesive toolkit for unsupervised learning in high-dimensional, noisy, and complex data. The proposed approaches are broadly applicable across fields where complexity and multidimensionality are standard, offering powerful solutions for extracting insights from intricate data landscapes.
Advances in unsupervised learning: integrating clustering and latent structure modeling
BOTTAZZI SCHENONE, MARIAELENA
2026
Abstract
In recent years, the explosive growth of data across various domains has underscored the urgent need for advanced statistical methodologies capable of capturing latent structures, managing high-dimensionality, and yielding interpretable insights. Traditional clustering and factor analysis techniques often prove inadequate in these contexts due to their limited flexibility, sensitivity to noise and outliers, and inability to uncover complex, overlapping, or hierarchical patterns. This thesis addresses these challenges by introducing novel multivariate methods focused on unsupervised learning, simultaneous clustering and dimensionality reduction, and latent structure modeling. The research conducted throughout this work has resulted in ten scientific papers, all of which have been either published or are currently under review in leading international journals. Among these, the five most significant contributions are discussed in detail in this thesis, while the other works—three already published, two under second-round review and one recently submitted—are also briefly referenced to provide a comprehensive overview of the research outcomes. Each thesis' chapter offers new methodological developments tailored to a distinct data structure or problem setting, while collectively reinforcing a broader vision: developing tools for flexible, interpretable, and robust analysis of high-dimensional data. The first two contributions investigate the connection between clustering and ranking of multivariate observations through Linear Ordered Partitions. Unlike traditional clustering that partitions units devoid of inherent order, the first model identifies clusters as equivalence classes ordered along a latent univariate dimension, thus yielding an optimal ranked partition of the data. The proposed method utilizes a constrained Factorial K-Means model. Complementing this, a second bootstrap-based methodology for clustering and ranking in a univariate context is also briefly introduced. It is named Cluster Ranking via Bootstrap $ K-Means, and it aligns with the Linear Ordered Partitions framework by constructing ranked equivalence classes focused on univariate data. This model identifies the maximum number of statistically distinct clusters, within which units are considered equivalent and differences are observed across clusters. Ranking is achieved through bootstrap confidence intervals for K-Means centroids, enabling both estimation of the optimal number of clusters and ranking within and across equivalence classes. Building upon integrating latent structure and clustering, we introduce the Fuzzy Reduced K-Means. Unlike standard fuzzy clustering, this model incorporates dimensionality reduction into the clustering process, unveiling latent dimensions that shape data structure while permitting observations to belong to multiple clusters with varying degrees of membership. Such flexibility is crucial for capturing the overlapping and multifaceted nature of real-world phenomena. Expanding upon this idea, the Generalized Reduced K-Means model is presented. It extends the Reduced K-Means framework by allowing different clusters to be associated with distinct latent subspaces. This model is particularly suited to scenarios where data dimensions contribute diversely to various subgroups. Attention then shifts to hierarchical modeling of latent structures. A foundational method for achieving such hierarchies is Structural Equation Modeling, utilized in a minor work to model complex interrelationships among multiple air pollutants and their determinants. Building on this, the innovative Ultrametric Factor Analysis model is introduced. It reconstructs the correlation matrix of Manifest Variables through a nested hierarchy of latent factors, forming an ultrametric tree. Unlike traditional factor analysis, this approach uncovers a unique and interpretable latent hierarchy, offering deep insights into underlying dimensions. Several work has been done in three-way setting. The Tucker3 model has been applied to socio-economic data and has been extended considering a disjoint and interpretable version. Moreover, a fuzzy entropic Triple K-Means model has been proposed. Shortly summarized in the Introduction, a third work simultaneously integrates clustering and latent variables modeling, softly partitioning the occasion mode of a unit-by-variable-by-occasion array into K clusters, producing K consensus matrices. Each consensus is analyzed with a Second-Order Disjoint Factor Analysis to extract first-order factors and a single General Factor, providing a compact, interpretable representation of the J variables within each cluster. The fourth work that considers three-way data, which is the one illustrated more in detail in this thesis, explores clustering each of the three modes of a data array within a robustness framework. It introduces dimension-wise and cell-wise robust extensions of the Triple K-Means algorithm, allowing for outlier detection and trimming at various levels: entire units, variables, or occasions (dimension-wise), or individual data cells (cell-wise). These methods, based on a trimmed Alternating Least-Squares optimization framework, incorporate a data-driven selection mechanism for outlier detection using the elbow method applied to second-order derivatives. Collectively, these ten studies contribute a cohesive toolkit for unsupervised learning in high-dimensional, noisy, and complex data. The proposed approaches are broadly applicable across fields where complexity and multidimensionality are standard, offering powerful solutions for extracting insights from intricate data landscapes.| File | Dimensione | Formato | |
|---|---|---|---|
|
Tesi_dottorato_BottazziSchenone.pdf
accesso aperto
Licenza:
Creative Commons
Dimensione
4.15 MB
Formato
Adobe PDF
|
4.15 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/357524
URN:NBN:IT:UNIROMA1-357524