A fast-growing body of applications demands methods to analyze increasingly complex structured data. The recent advances in measurement technologies have allowed the collection of an increasing amount of these types of data that are available to be analyzed. The ubiquity of relational data in various scientific fields is motivating the development of flexible statistical models in which a graph represents the fundamental unit of observation, often under the assumption that these are observed over the same entities, represented as labeled nodes of a graph. While a vast part of the literature has been focused on the development of methods to analyze the single network structure originating from the various collected measurements, only a few contributions allow for the assessment of topological similarities among multiple observed graphs. At the same time, analyzing a large amount of data –where storing the entire dataset in memory may not even be feasible– has introduced growing computational challenges, especially for complex structured data. This has driven recent interest in scalable algorithms that can process and learn by fetching data in mini-batches. The inherently categorical nature of many real-world problems makes it common in statistical modeling to work with distributions over categories, thus involving latent compositional variables defined on the probability simplex. As a result, there is a growing need across various fields to develop scalable learning mechanisms for these distributions, offering efficient and accurate solutions, faster than traditional methods. This thesis focuses on these emerging needs and provides three main contributions embracing the Bayesian perspective. Specifically, a novel Bayesian nonparametric mixture model for analyzing a set of graphical observations allows for a flexible approach to detect similarities among several measurements. The method is not limited to clustering but it can be deployed for a variety of subsequent inferential tasks, such as network predictions and estimation of probability distribution for graphs. Secondly, a scalable Markov chain Monte Carlo algorithm for efficient sampling from the probability simplex, with a focus on applications in large-scale Bayesian models, is developed. Given the wide-ranging and recurring importance of sampling from the probability simplex, the proposed computational method is highly appealing for analyzing large-scale datasets using models that involve high-dimensional discrete probability distributions. In this line, we additionally propose a scalable algorithm for a popular large-scale network model. A third, more applied, contribution deals with the problem of clustering a set of graphical observations in a high-dimensional setting, and it, therefore, lies at the intersection of the two former main contributions. Taking advantage of the structure inherent to graphical observations, and with the aim to analyze high-dimensional relational phenomena, we develop an approximate scalable computational strategy based on similarities at the local level. The contributions of the thesis, thus, demonstrate the ability of the Bayesian methodology in tackling complex modeling and computational problems from different methodological fields by relying on an elegant unified probabilistic framework.
Un numero crescente di applicazioni richiede metodi per analizzare dati strutturati sempre più complessi. I recenti progressi nelle tecnologie di misurazione hanno permesso la raccolta di una quantità crescente di questi tipi di dati, ora disponibili per l'analisi. L’onnipresenza di dati relazionali in vari campi scientifici sta motivando lo sviluppo di modelli statistici flessibili, in cui un grafo rappresenta l'unità fondamentale di osservazione, spesso sotto l’assunzione che siano osservati sugli stessi enti, rappresentati come nodi etichettati di un grafo. Mentre gran parte della letteratura si è concentrata sullo sviluppo di metodi per analizzare la struttura di una singola rete costruita a partire dalle varie misurazioni raccolte, solo pochi contributi consentono di valutare le somiglianze topologiche tra più grafi osservati. Allo stesso tempo, l'analisi di una grande quantità di dati –dove potrebbe non essere nemmeno possibile memorizzare l'intero set di dati– ha introdotto crescenti sfide computazionali, specialmente per dati strutturati complessi. Ciò ha stimolato un recente interesse verso algoritmi scalabili in grado di processare e apprendere recuperando dati in mini-lotti. La natura intrinsecamente categoriale di molti problemi reali rende comune, nella modellazione statistica, lavorare con distribuzioni su categorie, coinvolgendo quindi variabili latenti composizionali definite sul simplesso standard. Di conseguenza, in vari campi sta emergendo la necessità di sviluppare meccanismi di apprendimento scalabili per queste distribuzioni, offrendo soluzioni efficienti e accurate, più rapide rispetto ai metodi tradizionali. Questa tesi si concentra su queste esigenze emergenti e fornisce tre principali contributi abbracciando la prospettiva Bayesiana. In particolare, un nuovo modello di mistura Bayesiano non parametrico per l'analisi di un insieme di osservazioni grafiche fornisce un approccio flessibile per rilevare somiglianze tra diverse misurazioni. Il metodo non si limita al clustering, ma può essere utilizzato per una varietà di compiti inferenziali successivi, come la previsione di reti e la stima di distribuzioni di probabilità per grafi. In secondo luogo, viene sviluppato un algoritmo scalabile di tipo Markov chain Monte Carlo per il campionamento efficiente dal simplesso standard, con un focus su applicazioni in modelli Bayesiani di larga scala. Dato l'ampio e ricorrente interesse per il campionamento dal simplesso standard, il metodo computazionale proposto è altamente promettente per l'analisi di grandi set di dati utilizzando modelli che coinvolgono distribuzioni di probabilità discrete ad alta dimensione. In questa linea, proponiamo inoltre un algoritmo scalabile per un popolare modello di rete su larga scala. Un terzo contributo, più applicato, affronta il problema di clusterizzare un insieme di osservazioni grafiche in un contesto ad alta dimensione, e quindi si colloca all'intersezione dei due precedenti contributi principali. Sfruttando la struttura intrinseca delle osservazioni grafiche, e con l'obiettivo di analizzare fenomeni relazionali ad alta dimensionalità, sviluppiamo una strategia computazionale scalabile approssimata basata su somiglianze a livello locale. I contributi della tesi dimostrano così la capacità della metodologia Bayesiana di affrontare problemi complessi computazionali e di modellizzazione provenienti da diversi campi metodologici, facendo affidamento su un elegante quadro probabilistico unificato.
Advances in Bayesian Methods: Nonparametric Network Modeling and Scalable Computations
BARILE, FRANCESCO
2025
Abstract
A fast-growing body of applications demands methods to analyze increasingly complex structured data. The recent advances in measurement technologies have allowed the collection of an increasing amount of these types of data that are available to be analyzed. The ubiquity of relational data in various scientific fields is motivating the development of flexible statistical models in which a graph represents the fundamental unit of observation, often under the assumption that these are observed over the same entities, represented as labeled nodes of a graph. While a vast part of the literature has been focused on the development of methods to analyze the single network structure originating from the various collected measurements, only a few contributions allow for the assessment of topological similarities among multiple observed graphs. At the same time, analyzing a large amount of data –where storing the entire dataset in memory may not even be feasible– has introduced growing computational challenges, especially for complex structured data. This has driven recent interest in scalable algorithms that can process and learn by fetching data in mini-batches. The inherently categorical nature of many real-world problems makes it common in statistical modeling to work with distributions over categories, thus involving latent compositional variables defined on the probability simplex. As a result, there is a growing need across various fields to develop scalable learning mechanisms for these distributions, offering efficient and accurate solutions, faster than traditional methods. This thesis focuses on these emerging needs and provides three main contributions embracing the Bayesian perspective. Specifically, a novel Bayesian nonparametric mixture model for analyzing a set of graphical observations allows for a flexible approach to detect similarities among several measurements. The method is not limited to clustering but it can be deployed for a variety of subsequent inferential tasks, such as network predictions and estimation of probability distribution for graphs. Secondly, a scalable Markov chain Monte Carlo algorithm for efficient sampling from the probability simplex, with a focus on applications in large-scale Bayesian models, is developed. Given the wide-ranging and recurring importance of sampling from the probability simplex, the proposed computational method is highly appealing for analyzing large-scale datasets using models that involve high-dimensional discrete probability distributions. In this line, we additionally propose a scalable algorithm for a popular large-scale network model. A third, more applied, contribution deals with the problem of clustering a set of graphical observations in a high-dimensional setting, and it, therefore, lies at the intersection of the two former main contributions. Taking advantage of the structure inherent to graphical observations, and with the aim to analyze high-dimensional relational phenomena, we develop an approximate scalable computational strategy based on similarities at the local level. The contributions of the thesis, thus, demonstrate the ability of the Bayesian methodology in tackling complex modeling and computational problems from different methodological fields by relying on an elegant unified probabilistic framework.File | Dimensione | Formato | |
---|---|---|---|
phd_unimib_868624.pdf
accesso aperto
Dimensione
6.24 MB
Formato
Adobe PDF
|
6.24 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/201104
URN:NBN:IT:UNIMIB-201104