The increasing complexity of data has not, for the most part, changed the objectives of statistical modelling: interpretation, communication and prediction of phenomena. As such, it is the models and the mathematical tools that must accommodate for such new scenarios and allow to separate the signal from the noise. It is possible to devise both frequentist and Bayesian approaches to such problems. In this thesis, we present two concrete cases. The first is a linear regression setting, in which traditional methods would not allow to effectively model the phenomenon. It is the case in which the dimension of the output is vastly greater than the dimension of the input. The envelope model (Cook, 2018) assumes that a linear combination of the output is not affected by variations in the input to shrink the variance of the estimators and, thus, extend a classical procedure to an otherwise unfriendly setting. We propose three innovations to the model, interpreted from a Bayesian point of view. The first is a fully Bayesian treatment of the model itself and of the latent space dimension of the variables that are affected by changes in the input; the second is a mixture of such models to extend regressions to the cases in which there might be hidden cluster structures within the data and it is not possible to condition the models on them; the third is more computational in nature and focuses on the derivation of an approximate posterior distribution whose goal is to embed the model into a Euclidean space to ease computations. The second case we present is a network data setting. Structural dependencies amongst data can hinder inferential procedures, rendering them invalid or sub-optimal. We here try to exploit, instead, the topology of the graph that expresses the relations among the data to both obtained some desired sparsity properties of the inferential procedures and speed computations up. To this aim, we extend the Horseshoe prior, ascale mixture of Gaussian distribution, that is widely applied in Bayesian sparse regression, to this dependent case by incorporating dependence within the mixture by a graph Laplacian operator. We then derive a Gibbs sampler linear in the number of nodes to sample from the posterior. We show that there are gains in the reconstruction error for some simulated and some real data by censoring uniformly at random a portion of a number of datasets we consider. Cook, R. D. (2018) An Introduction to Envelopes: Dimension Reduction for Efficient Estimation in Multivariate Statistics. First edition. Wiley Series in Probability and Statistics. Hoboken, NJ: John Wiley & Sons. ISBN 978-1-119-42293-8.
Bayesian Sparse Model for Complex Data
MASCARETTI, ANDREA
2024
Abstract
The increasing complexity of data has not, for the most part, changed the objectives of statistical modelling: interpretation, communication and prediction of phenomena. As such, it is the models and the mathematical tools that must accommodate for such new scenarios and allow to separate the signal from the noise. It is possible to devise both frequentist and Bayesian approaches to such problems. In this thesis, we present two concrete cases. The first is a linear regression setting, in which traditional methods would not allow to effectively model the phenomenon. It is the case in which the dimension of the output is vastly greater than the dimension of the input. The envelope model (Cook, 2018) assumes that a linear combination of the output is not affected by variations in the input to shrink the variance of the estimators and, thus, extend a classical procedure to an otherwise unfriendly setting. We propose three innovations to the model, interpreted from a Bayesian point of view. The first is a fully Bayesian treatment of the model itself and of the latent space dimension of the variables that are affected by changes in the input; the second is a mixture of such models to extend regressions to the cases in which there might be hidden cluster structures within the data and it is not possible to condition the models on them; the third is more computational in nature and focuses on the derivation of an approximate posterior distribution whose goal is to embed the model into a Euclidean space to ease computations. The second case we present is a network data setting. Structural dependencies amongst data can hinder inferential procedures, rendering them invalid or sub-optimal. We here try to exploit, instead, the topology of the graph that expresses the relations among the data to both obtained some desired sparsity properties of the inferential procedures and speed computations up. To this aim, we extend the Horseshoe prior, ascale mixture of Gaussian distribution, that is widely applied in Bayesian sparse regression, to this dependent case by incorporating dependence within the mixture by a graph Laplacian operator. We then derive a Gibbs sampler linear in the number of nodes to sample from the posterior. We show that there are gains in the reconstruction error for some simulated and some real data by censoring uniformly at random a portion of a number of datasets we consider. Cook, R. D. (2018) An Introduction to Envelopes: Dimension Reduction for Efficient Estimation in Multivariate Statistics. First edition. Wiley Series in Probability and Statistics. Hoboken, NJ: John Wiley & Sons. ISBN 978-1-119-42293-8.File | Dimensione | Formato | |
---|---|---|---|
tesi_definitiva_Andrea_Mascaretti.pdf
accesso aperto
Dimensione
1.41 MB
Formato
Adobe PDF
|
1.41 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/160683
URN:NBN:IT:UNIPD-160683