This thesis explores data-centric and computational approaches to public health, integrating methods from data engineering, Artificial Intelligence (AI), and statistical evaluation. The overarching goal is to promote reliability, interpretability, and sustainability in the management and analysis of health data. The first chapter addresses the challenge of hallucinations in Large Language Models (LLMs). It presents a Retrieval Augmented Generation (RAG) framework grounded in external sources of knowledge and enhanced by domain-specific prompt engineering for healthcare. To evaluate reliability, the Negative Missing Information Scoring System (NMISS) is introduced, a system-level scoring that extends standard metrics with contextual verification. Empirical tests on Italian healthcare-related news articles show how RAG and NMISS together improve the trustworthiness of LLM outputs. The second chapter introduces a Multimodal hEalth Data lakehouse for ITAly (MEDITA), a multimodal Lakehouse designed for Italian public health data. By integrating structured and unstructured sources through adaptive pipelines, MEDITA provides a unified environment for statistical analysis, forecasting, and interactive exploration. This proof-of-concept demonstrates the feasibility of a national-scale infrastructure that bridges the gap between raw data availability and actionable insights. The third chapter focuses on sustainability in machine learning, framed within the paradigm of Green AI. It delivers a comprehensive study of MultiClass Classification (MCC) strategies, systematically comparing accuracy, training time, and environmental impact. A dedicated evaluation pipeline monitors energy consumption and CO2 emissions. Results reveal that lightweight classifiers achieve competitive accuracy at a fraction of the cost of heavy models, underscoring the importance of balancing predictive performance with environmental responsibility.
Essays on Data Frameworks and Sustainable AI for Public Health
PRIOLA, MARIA PAOLA
2026
Abstract
This thesis explores data-centric and computational approaches to public health, integrating methods from data engineering, Artificial Intelligence (AI), and statistical evaluation. The overarching goal is to promote reliability, interpretability, and sustainability in the management and analysis of health data. The first chapter addresses the challenge of hallucinations in Large Language Models (LLMs). It presents a Retrieval Augmented Generation (RAG) framework grounded in external sources of knowledge and enhanced by domain-specific prompt engineering for healthcare. To evaluate reliability, the Negative Missing Information Scoring System (NMISS) is introduced, a system-level scoring that extends standard metrics with contextual verification. Empirical tests on Italian healthcare-related news articles show how RAG and NMISS together improve the trustworthiness of LLM outputs. The second chapter introduces a Multimodal hEalth Data lakehouse for ITAly (MEDITA), a multimodal Lakehouse designed for Italian public health data. By integrating structured and unstructured sources through adaptive pipelines, MEDITA provides a unified environment for statistical analysis, forecasting, and interactive exploration. This proof-of-concept demonstrates the feasibility of a national-scale infrastructure that bridges the gap between raw data availability and actionable insights. The third chapter focuses on sustainability in machine learning, framed within the paradigm of Green AI. It delivers a comprehensive study of MultiClass Classification (MCC) strategies, systematically comparing accuracy, training time, and environmental impact. A dedicated evaluation pipeline monitors energy consumption and CO2 emissions. Results reveal that lightweight classifiers achieve competitive accuracy at a fraction of the cost of heavy models, underscoring the importance of balancing predictive performance with environmental responsibility.| File | Dimensione | Formato | |
|---|---|---|---|
|
mariapaolapriola_tesidottorato_rev.pdf
embargo fino al 10/03/2027
Licenza:
Tutti i diritti riservati
Dimensione
4.5 MB
Formato
Adobe PDF
|
4.5 MB | Adobe PDF |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/360613
URN:NBN:IT:UNICA-360613