Bayesian Sparse Model for Complex Data

Mascaretti, Andrea

The increasing complexity of data has not, for the most part, changed the objectives of statistical modelling: interpretation, communication and prediction of phenomena. As such, it is the models and the mathematical tools that must accommodate for such new scenarios and allow to separate the signal from the noise. It is possible to devise both frequentist and Bayesian approaches to such problems. In this thesis, we present two concrete cases. The first is a linear regression setting, in which traditional methods would not allow to effectively model the phenomenon. It is the case in which the dimension of the output is vastly greater than the dimension of the input. The envelope model (Cook, 2018) assumes that a linear combination of the output is not affected by variations in the input to shrink the variance of the estimators and, thus, extend a classical procedure to an otherwise unfriendly setting. We propose three innovations to the model, interpreted from a Bayesian point of view. The first is a fully Bayesian treatment of the model itself and of the latent space dimension of the variables that are affected by changes in the input; the second is a mixture of such models to extend regressions to the cases in which there might be hidden cluster structures within the data and it is not possible to condition the models on them; the third is more computational in nature and focuses on the derivation of an approximate posterior distribution whose goal is to embed the model into a Euclidean space to ease computations. The second case we present is a network data setting. Structural dependencies amongst data can hinder inferential procedures, rendering them invalid or sub-optimal. We here try to exploit, instead, the topology of the graph that expresses the relations among the data to both obtained some desired sparsity properties of the inferential procedures and speed computations up. To this aim, we extend the Horseshoe prior, ascale mixture of Gaussian distribution, that is widely applied in Bayesian sparse regression, to this dependent case by incorporating dependence within the mixture by a graph Laplacian operator. We then derive a Gibbs sampler linear in the number of nodes to sample from the posterior. We show that there are gains in the reconstruction error for some simulated and some real data by censoring uniformly at random a portion of a number of datasets we consider. Cook, R. D. (2018) An Introduction to Envelopes: Dimension Reduction for Efficient Estimation in Multivariate Statistics. First edition. Wiley Series in Probability and Statistics. Hoboken, NJ: John Wiley & Sons. ISBN 978-1-119-42293-8.

Bayesian Sparse Model for Complex Data

MASCARETTI, ANDREA

2024

Abstract

The increasing complexity of data has not, for the most part, changed the objectives of statistical modelling: interpretation, communication and prediction of phenomena. As such, it is the models and the mathematical tools that must accommodate for such new scenarios and allow to separate the signal from the noise. It is possible to devise both frequentist and Bayesian approaches to such problems. In this thesis, we present two concrete cases. The first is a linear regression setting, in which traditional methods would not allow to effectively model the phenomenon. It is the case in which the dimension of the output is vastly greater than the dimension of the input. The envelope model (Cook, 2018) assumes that a linear combination of the output is not affected by variations in the input to shrink the variance of the estimators and, thus, extend a classical procedure to an otherwise unfriendly setting. We propose three innovations to the model, interpreted from a Bayesian point of view. The first is a fully Bayesian treatment of the model itself and of the latent space dimension of the variables that are affected by changes in the input; the second is a mixture of such models to extend regressions to the cases in which there might be hidden cluster structures within the data and it is not possible to condition the models on them; the third is more computational in nature and focuses on the derivation of an approximate posterior distribution whose goal is to embed the model into a Euclidean space to ease computations. The second case we present is a network data setting. Structural dependencies amongst data can hinder inferential procedures, rendering them invalid or sub-optimal. We here try to exploit, instead, the topology of the graph that expresses the relations among the data to both obtained some desired sparsity properties of the inferential procedures and speed computations up. To this aim, we extend the Horseshoe prior, ascale mixture of Gaussian distribution, that is widely applied in Bayesian sparse regression, to this dependent case by incorporating dependence within the mixture by a graph Laplacian operator. We then derive a Gibbs sampler linear in the number of nodes to sample from the posterior. We show that there are gains in the reconstruction error for some simulated and some real data by censoring uniformly at random a portion of a number of datasets we consider. Cook, R. D. (2018) An Introduction to Envelopes: Dimension Reduction for Efficient Estimation in Multivariate Statistics. First edition. Wiley Series in Probability and Statistics. Hoboken, NJ: John Wiley & Sons. ISBN 978-1-119-42293-8.

Scheda breve

Scheda completa

Scheda completa (DC)

	Corso di studio
	
				SCIENZE STATISTICHE
			
	Data di pubblicazione
	
				7-mag-2024
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				CANALE, ANTONIO
			
	Nome Editore
	
				Università degli studi di Padova
			
	Collezione di appartenenza
	
				Università degli Studi di Padova

File in questo prodotto:

File	Dimensione	Formato
tesi_definitiva_Andrea_Mascaretti.pdf accesso aperto Dimensione 1.41 MB Formato Adobe PDF Visualizza/Apri	1.41 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/160683

Il codice NBN di questa tesi è URN:NBN:IT:UNIPD-160683