Language models for information quality: methods and applications

Mathew, Jerin George

Ensuring high-quality information is fundamental to modern data-driven decision-making systems. This thesis explores the role of language models (LMs) and large language models (LLMs) in enhancing information quality (IQ), spanning tasks such as data cleaning, uncertainty estimation, on-demand data retrieval, and fairness in subjective data ranking. The first part of this work focuses on data cleaning, particularly entity resolution (ER) and entity count estimation, proposing a framework that integrates machine learning, clustering, and statistical approaches to efficiently estimate the number of distinct entities in large datasets. A sampling-based pipeline is introduced to improve scalability without compromising accuracy. The second part investigates uncertainty estimation in LLM-generated responses, proposing a Bayesian crowdsourcing framework to assess and aggregate outputs from multiple models. This enables more reliable decision-making by quantifying the confidence in generated information. Furthermore, this thesis explores the use of LLMs for automating structured data retrieval from heterogeneous sources, demonstrating their effectiveness in industrial applications where real-time insights are required. Finally, the thesis addresses ethical data quality, with a particular focus on fairness in ranking systems that rely on subjective data. A fairness assessment pipeline is introduced to measure exposure disparities across different groups in collaborative rating platforms. The proposed methodology quantifies both item-level and query-level fairness, ensuring balanced representation in ranked outputs. Through a combination of machine learning, Bayesian inference, and LLM-based techniques, this thesis advances the state of the art in ensuring reliability, fairness, and efficiency in data-driven applications. The proposed methodologies are validated through extensive experiments on real-world datasets, offering practical solutions for improving information quality across diverse domains.

Language models for information quality: methods and applications

MATHEW, JERIN GEORGE

2025

Abstract

Ensuring high-quality information is fundamental to modern data-driven decision-making systems. This thesis explores the role of language models (LMs) and large language models (LLMs) in enhancing information quality (IQ), spanning tasks such as data cleaning, uncertainty estimation, on-demand data retrieval, and fairness in subjective data ranking. The first part of this work focuses on data cleaning, particularly entity resolution (ER) and entity count estimation, proposing a framework that integrates machine learning, clustering, and statistical approaches to efficiently estimate the number of distinct entities in large datasets. A sampling-based pipeline is introduced to improve scalability without compromising accuracy. The second part investigates uncertainty estimation in LLM-generated responses, proposing a Bayesian crowdsourcing framework to assess and aggregate outputs from multiple models. This enables more reliable decision-making by quantifying the confidence in generated information. Furthermore, this thesis explores the use of LLMs for automating structured data retrieval from heterogeneous sources, demonstrating their effectiveness in industrial applications where real-time insights are required. Finally, the thesis addresses ethical data quality, with a particular focus on fairness in ranking systems that rely on subjective data. A fairness assessment pipeline is introduced to measure exposure disparities across different groups in collaborative rating platforms. The proposed methodology quantifies both item-level and query-level fairness, ensuring balanced representation in ranked outputs. Through a combination of machine learning, Bayesian inference, and LLM-based techniques, this thesis advances the state of the art in ensuring reliability, fairness, and efficiency in data-driven applications. The proposed methodologies are validated through extensive experiments on real-world datasets, offering practical solutions for improving information quality across diverse domains.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				DIPARTIMENTO DI INGEGNERIA INFORMATICA, AUTOMATICA E GESTIONALE -ANTONIO RUBERTI-
			
	Corso di studio
	
				Altro corso di dottorato
			
	Data di pubblicazione
	
				23-gen-2025
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				FIRMANI, DONATELLA
MECELLA, Massimo
			
	Correlatore, Controrelatore, Co-Supervisor,  Co-Tutor o Coordinatori
	
				LENZERINI, Maurizio
			
	Nome Editore
	
				Università degli Studi di Roma "La Sapienza"
			
	Collezione di appartenenza
	
				Università degli Studi di Roma La Sapienza

File in questo prodotto:

File	Dimensione	Formato
Tesi_dottorato_Mathew.pdf accesso aperto Licenza: Tutti i diritti riservati Dimensione 3.57 MB Formato Adobe PDF Visualizza/Apri	3.57 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/190309

Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA1-190309