This thesis explores data-centric and computational approaches to public health, integrating methods from data engineering, Artificial Intelligence (AI), and statistical evaluation. The overarching goal is to promote reliability, interpretability, and sustainability in the management and analysis of health data. The first chapter addresses the challenge of hallucinations in Large Language Models (LLMs). It presents a Retrieval Augmented Generation (RAG) framework grounded in external sources of knowledge and enhanced by domain-specific prompt engineering for healthcare. To evaluate reliability, the Negative Missing Information Scoring System (NMISS) is introduced, a system-level scoring that extends standard metrics with contextual verification. Empirical tests on Italian healthcare-related news articles show how RAG and NMISS together improve the trustworthiness of LLM outputs. The second chapter introduces a Multimodal hEalth Data lakehouse for ITAly (MEDITA), a multimodal Lakehouse designed for Italian public health data. By integrating structured and unstructured sources through adaptive pipelines, MEDITA provides a unified environment for statistical analysis, forecasting, and interactive exploration. This proof-of-concept demonstrates the feasibility of a national-scale infrastructure that bridges the gap between raw data availability and actionable insights. The third chapter focuses on sustainability in machine learning, framed within the paradigm of Green AI. It delivers a comprehensive study of MultiClass Classification (MCC) strategies, systematically comparing accuracy, training time, and environmental impact. A dedicated evaluation pipeline monitors energy consumption and CO2 emissions. Results reveal that lightweight classifiers achieve competitive accuracy at a fraction of the cost of heavy models, underscoring the importance of balancing predictive performance with environmental responsibility.

Essays on Data Frameworks and Sustainable AI for Public Health

PRIOLA, MARIA PAOLA
2026

Abstract

This thesis explores data-centric and computational approaches to public health, integrating methods from data engineering, Artificial Intelligence (AI), and statistical evaluation. The overarching goal is to promote reliability, interpretability, and sustainability in the management and analysis of health data. The first chapter addresses the challenge of hallucinations in Large Language Models (LLMs). It presents a Retrieval Augmented Generation (RAG) framework grounded in external sources of knowledge and enhanced by domain-specific prompt engineering for healthcare. To evaluate reliability, the Negative Missing Information Scoring System (NMISS) is introduced, a system-level scoring that extends standard metrics with contextual verification. Empirical tests on Italian healthcare-related news articles show how RAG and NMISS together improve the trustworthiness of LLM outputs. The second chapter introduces a Multimodal hEalth Data lakehouse for ITAly (MEDITA), a multimodal Lakehouse designed for Italian public health data. By integrating structured and unstructured sources through adaptive pipelines, MEDITA provides a unified environment for statistical analysis, forecasting, and interactive exploration. This proof-of-concept demonstrates the feasibility of a national-scale infrastructure that bridges the gap between raw data availability and actionable insights. The third chapter focuses on sustainability in machine learning, framed within the paradigm of Green AI. It delivers a comprehensive study of MultiClass Classification (MCC) strategies, systematically comparing accuracy, training time, and environmental impact. A dedicated evaluation pipeline monitors energy consumption and CO2 emissions. Results reveal that lightweight classifiers achieve competitive accuracy at a fraction of the cost of heavy models, underscoring the importance of balancing predictive performance with environmental responsibility.
10-mar-2026
Inglese
CONVERSANO, CLAUDIO
ORTU, MARCO
Università degli Studi di Cagliari
File in questo prodotto:
File Dimensione Formato  
mariapaolapriola_tesidottorato_rev.pdf

embargo fino al 10/03/2027

Licenza: Tutti i diritti riservati
Dimensione 4.5 MB
Formato Adobe PDF
4.5 MB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/360613
Il codice NBN di questa tesi è URN:NBN:IT:UNICA-360613