Data Integration and Official Statistics, with a focus on Bayesian models for population size estimation

Tuoto, Tiziana

The integrated use and the re-use of data coming from different sources is a common practice in official statistics and it is recognized by the international community as a key element of modernization of the statistical system. Actually, data generated for purposes other than statistical can often be easily acquired at a low cost, hence data integration reduces the costs of data collection and limits the statistical burden on the respondents. In this research project, we have developed three different aspects related to data integration activities in official statistics. In Chapter 1 we considered the use of data from administrative archives to support survey data on a sensitive variable, income. This research was carried out in cooperation with Prof. Li-Chun Zhang, from Statistics Norway, University of Southampton, and Olso University, during his frequent visits to Rome, at the National Statistical Institute (Istat) and Sapienza University. We assumed that a data linkage has been performed to combine administrative data and survey data with the aim of identifying and bringing together records from separate files, which correspond to the same entities. Usually, data linkage is not a trivial procedure and linkage errors, false and missed links, might affect standard statistical techniques, producing misleading inference. In this setting, we developed a regression model on integrated data for secondary analysis, where the linked data has been prepared by someone else, and neither the match-key variables nor the unlinked records are available to the analyst. We developed also a diagnostic test for the assumption of non-informative linkage errors, which is required for our proposal as well as for all existing secondary analysis adjustment methods. Compared to other adjustment methods, our approach provides important advantages: it relies on a realistic assumption that the probabilities of correct linkage vary across the records but it does not assume that one is able to estimate the probability of correct linkage for each individual record. Moreover, it accommodates in a simple manner the general situation where the files are of different sizes and none of them is a subset of another. The adjusted regression model and the proposed test have been studied by simulation and also applied to real data. The research illustrated in Chapter 1 has published as original article by the Journal of the Royal Statistical Society: Series A (Statistics in Society) Volume 184, Issue 2. In Chapter 2, we dealt with a different data integration problem. We considered an additional re–use of an administrative register on prosecuted crimes to estimate the size of certain criminal populations, and in particular the size of those involved in criminal activities but for some reasons unreported to the justice system. In the capture-recapture framework of repeated count data, we focused on the identification and treatment of “one– inflation”. This phenomenon occurs when the number of units captured exactly once clearly exceeds the expectation under a baseline count distribution. It has received increasing attention in capture–recapture literature in recent years, since ignoring one–inflation has serious consequences for the estimation of the population size, which can be drastically overestimated. Criminal data might be particularly prone to the one–inflation, since people involved might develop an extreme form of the so–called “trap shy” behavioural model, i.e. the will and ability to avoid subsequent captures. In Chapter 2, in a joint work with supervisors Prof. Andrea Tancredi and Dr. Davide Di Cecco, we proposed a Bayesian approach for Poisson, Geometric and Negative Binomial one–inflated count distributions. Posterior inference for population size is obtained applying a Gibbs sampler approach. We also provided a Bayesian approach to model selection. We illustrated the proposed methodology with simulated and real data to estimate the number of people implicated in the exploitation of prostitution in Italy. The research illustrated in Chapter 2 has been published as research article by Biometrical Journal Volume 64, Issue 5, in March 2022.

Data Integration and Official Statistics, with a focus on Bayesian models for population size estimation

TUOTO, TIZIANA

2022

Abstract

The integrated use and the re-use of data coming from different sources is a common practice in official statistics and it is recognized by the international community as a key element of modernization of the statistical system. Actually, data generated for purposes other than statistical can often be easily acquired at a low cost, hence data integration reduces the costs of data collection and limits the statistical burden on the respondents. In this research project, we have developed three different aspects related to data integration activities in official statistics. In Chapter 1 we considered the use of data from administrative archives to support survey data on a sensitive variable, income. This research was carried out in cooperation with Prof. Li-Chun Zhang, from Statistics Norway, University of Southampton, and Olso University, during his frequent visits to Rome, at the National Statistical Institute (Istat) and Sapienza University. We assumed that a data linkage has been performed to combine administrative data and survey data with the aim of identifying and bringing together records from separate files, which correspond to the same entities. Usually, data linkage is not a trivial procedure and linkage errors, false and missed links, might affect standard statistical techniques, producing misleading inference. In this setting, we developed a regression model on integrated data for secondary analysis, where the linked data has been prepared by someone else, and neither the match-key variables nor the unlinked records are available to the analyst. We developed also a diagnostic test for the assumption of non-informative linkage errors, which is required for our proposal as well as for all existing secondary analysis adjustment methods. Compared to other adjustment methods, our approach provides important advantages: it relies on a realistic assumption that the probabilities of correct linkage vary across the records but it does not assume that one is able to estimate the probability of correct linkage for each individual record. Moreover, it accommodates in a simple manner the general situation where the files are of different sizes and none of them is a subset of another. The adjusted regression model and the proposed test have been studied by simulation and also applied to real data. The research illustrated in Chapter 1 has published as original article by the Journal of the Royal Statistical Society: Series A (Statistics in Society) Volume 184, Issue 2. In Chapter 2, we dealt with a different data integration problem. We considered an additional re–use of an administrative register on prosecuted crimes to estimate the size of certain criminal populations, and in particular the size of those involved in criminal activities but for some reasons unreported to the justice system. In the capture-recapture framework of repeated count data, we focused on the identification and treatment of “one– inflation”. This phenomenon occurs when the number of units captured exactly once clearly exceeds the expectation under a baseline count distribution. It has received increasing attention in capture–recapture literature in recent years, since ignoring one–inflation has serious consequences for the estimation of the population size, which can be drastically overestimated. Criminal data might be particularly prone to the one–inflation, since people involved might develop an extreme form of the so–called “trap shy” behavioural model, i.e. the will and ability to avoid subsequent captures. In Chapter 2, in a joint work with supervisors Prof. Andrea Tancredi and Dr. Davide Di Cecco, we proposed a Bayesian approach for Poisson, Geometric and Negative Binomial one–inflated count distributions. Posterior inference for population size is obtained applying a Gibbs sampler approach. We also provided a Bayesian approach to model selection. We illustrated the proposed methodology with simulated and real data to estimate the number of people implicated in the exploitation of prostitution in Italy. The research illustrated in Chapter 2 has been published as research article by Biometrical Journal Volume 64, Issue 5, in March 2022.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				DIPARTIMENTO DI METODI E MODELLI PER L'ECONOMIA, IL TERRITORIO E LA FINANZA
			
	Corso di studio
	
				Metodi e modelli per l'economia e la finanza
			
	Data di pubblicazione
	
				27-set-2022
			
	Lingua
	
				Inglese
			
	Parola chiave
	
				Official Statistics; data integration; data linkage; population size estimation; semi-parametric Bayesian models
			
	Relatore, Supervisor, Advisor o Tutor
	
				TANCREDI, ANDREA
DI CECCO, DAVIDE
			
	Nome Editore
	
				Università degli Studi di Roma "La Sapienza"
			
	Collezione di appartenenza
	
				Università degli Studi di Roma La Sapienza

File in questo prodotto:

File	Dimensione	Formato
Tesi_dottorato_Tuoto.pdf accesso aperto Licenza: Tutti i diritti riservati Dimensione 1.51 MB Formato Adobe PDF Visualizza/Apri	1.51 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/100077

Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA1-100077