Generative Artificial Intelligence, particularly text-to-image models, has achieved remark- able progress, yet this rapid scaling has introduced critical bottlenecks in reliability, control- lability, and computational sustainability. Current evaluation protocols, typically relying on post-hoc global metrics such as LLM-based scores, are insufficient for diagnosing fine-grained failures or guiding resource-efficient generation. This thesis argues for an evaluation-centric lifecycle, in which evaluation mechanisms are embedded directly into the generative pipeline to actively monitor, verify, and refine model behaviour. Concretely, we instantiate this lifecycle with three methodological contributions that act at distinct stages of the image generation and editing process. Hallucination Early Detection (HEaD) monitors internal consistency during the diffusion process to anticipate semantic failures and enable early stopping or resampling, reducing unnecessary compute. Visual Concept Evaluation (ViCE) performs post-hoc, concept-level verification by decomposing prompts into atomic visual concepts and using visual question answering to diagnose specific failures, producing interpretable, concept-level explanations of where and why a generation fails, rather than a single aggregate score. These explanations enable practitioners to identify systematic weaknesses and to target improvements at specific visual concepts. Finally, addressing the editing domain, Differential Evaluation of Localised Image Edits (DICE) jointly detects what has changed under instruction-guided editing and assesses whether those changes align with the user’s intent. Unified by three cross-cutting pillars of explainability, granularity, and sustainability, these contributions demonstrate that treating evaluation as a first-class component enables the development of generative systems that are more robust and aligned with human intent. By moving from opaque global scores to structured, actionable feedback at the level of individual generations and edits, we establish a closed loop in which evaluation directly informs and improves the generative process.

Active Evaluation for Generative AI: Towards an Adaptive Life-cycle

Betti, Federico
2026

Abstract

Generative Artificial Intelligence, particularly text-to-image models, has achieved remark- able progress, yet this rapid scaling has introduced critical bottlenecks in reliability, control- lability, and computational sustainability. Current evaluation protocols, typically relying on post-hoc global metrics such as LLM-based scores, are insufficient for diagnosing fine-grained failures or guiding resource-efficient generation. This thesis argues for an evaluation-centric lifecycle, in which evaluation mechanisms are embedded directly into the generative pipeline to actively monitor, verify, and refine model behaviour. Concretely, we instantiate this lifecycle with three methodological contributions that act at distinct stages of the image generation and editing process. Hallucination Early Detection (HEaD) monitors internal consistency during the diffusion process to anticipate semantic failures and enable early stopping or resampling, reducing unnecessary compute. Visual Concept Evaluation (ViCE) performs post-hoc, concept-level verification by decomposing prompts into atomic visual concepts and using visual question answering to diagnose specific failures, producing interpretable, concept-level explanations of where and why a generation fails, rather than a single aggregate score. These explanations enable practitioners to identify systematic weaknesses and to target improvements at specific visual concepts. Finally, addressing the editing domain, Differential Evaluation of Localised Image Edits (DICE) jointly detects what has changed under instruction-guided editing and assesses whether those changes align with the user’s intent. Unified by three cross-cutting pillars of explainability, granularity, and sustainability, these contributions demonstrate that treating evaluation as a first-class component enables the development of generative systems that are more robust and aligned with human intent. By moving from opaque global scores to structured, actionable feedback at the level of individual generations and edits, we establish a closed loop in which evaluation directly informs and improves the generative process.
20-mar-2026
Inglese
Cornia, Marcella; Ballan, Lamberto
Sebe, Niculae
Università degli studi di Trento
TRENTO
150
File in questo prodotto:
File Dimensione Formato  
BettiFedericoThesisFinal_onlyThesis.pdf

accesso aperto

Licenza: Creative Commons
Dimensione 20.89 MB
Formato Adobe PDF
20.89 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/363208
Il codice NBN di questa tesi è URN:NBN:IT:UNITN-363208