Active Evaluation for Generative AI: Towards an Adaptive Life-cycle

Betti, Federico

Generative Artificial Intelligence, particularly text-to-image models, has achieved remark- able progress, yet this rapid scaling has introduced critical bottlenecks in reliability, control- lability, and computational sustainability. Current evaluation protocols, typically relying on post-hoc global metrics such as LLM-based scores, are insufficient for diagnosing fine-grained failures or guiding resource-efficient generation. This thesis argues for an evaluation-centric lifecycle, in which evaluation mechanisms are embedded directly into the generative pipeline to actively monitor, verify, and refine model behaviour. Concretely, we instantiate this lifecycle with three methodological contributions that act at distinct stages of the image generation and editing process. Hallucination Early Detection (HEaD) monitors internal consistency during the diffusion process to anticipate semantic failures and enable early stopping or resampling, reducing unnecessary compute. Visual Concept Evaluation (ViCE) performs post-hoc, concept-level verification by decomposing prompts into atomic visual concepts and using visual question answering to diagnose specific failures, producing interpretable, concept-level explanations of where and why a generation fails, rather than a single aggregate score. These explanations enable practitioners to identify systematic weaknesses and to target improvements at specific visual concepts. Finally, addressing the editing domain, Differential Evaluation of Localised Image Edits (DICE) jointly detects what has changed under instruction-guided editing and assesses whether those changes align with the user’s intent. Unified by three cross-cutting pillars of explainability, granularity, and sustainability, these contributions demonstrate that treating evaluation as a first-class component enables the development of generative systems that are more robust and aligned with human intent. By moving from opaque global scores to structured, actionable feedback at the level of individual generations and edits, we establish a closed loop in which evaluation directly informs and improves the generative process.

Active Evaluation for Generative AI: Towards an Adaptive Life-cycle

Betti, Federico

2026

Abstract

Generative Artificial Intelligence, particularly text-to-image models, has achieved remark- able progress, yet this rapid scaling has introduced critical bottlenecks in reliability, control- lability, and computational sustainability. Current evaluation protocols, typically relying on post-hoc global metrics such as LLM-based scores, are insufficient for diagnosing fine-grained failures or guiding resource-efficient generation. This thesis argues for an evaluation-centric lifecycle, in which evaluation mechanisms are embedded directly into the generative pipeline to actively monitor, verify, and refine model behaviour. Concretely, we instantiate this lifecycle with three methodological contributions that act at distinct stages of the image generation and editing process. Hallucination Early Detection (HEaD) monitors internal consistency during the diffusion process to anticipate semantic failures and enable early stopping or resampling, reducing unnecessary compute. Visual Concept Evaluation (ViCE) performs post-hoc, concept-level verification by decomposing prompts into atomic visual concepts and using visual question answering to diagnose specific failures, producing interpretable, concept-level explanations of where and why a generation fails, rather than a single aggregate score. These explanations enable practitioners to identify systematic weaknesses and to target improvements at specific visual concepts. Finally, addressing the editing domain, Differential Evaluation of Localised Image Edits (DICE) jointly detects what has changed under instruction-guided editing and assesses whether those changes align with the user’s intent. Unified by three cross-cutting pillars of explainability, granularity, and sustainability, these contributions demonstrate that treating evaluation as a first-class component enables the development of generative systems that are more robust and aligned with human intent. By moving from opaque global scores to structured, actionable feedback at the level of individual generations and edits, we establish a closed loop in which evaluation directly informs and improves the generative process.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				Ingegneria e scienza dell'Informaz (29/10/12-)
			
	Corso di studio
	
				Information and Communication Technology
			
	Data di pubblicazione
	
				20-mar-2026
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				Cornia, Marcella; Ballan, Lamberto
Sebe, Niculae
			
	Nome Editore
	
				Università degli studi di Trento
			
	Città Editore
	
				TRENTO
			
	Numero di pagine
	
				150
			
	Collezione di appartenenza
	
				Università degli Studi di Trento

File in questo prodotto:

File	Dimensione	Formato
BettiFedericoThesisFinal_onlyThesis.pdf accesso aperto Licenza: Creative Commons Dimensione 20.89 MB Formato Adobe PDF Visualizza/Apri	20.89 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/363208

Il codice NBN di questa tesi è URN:NBN:IT:UNITN-363208