This thesis investigates how to extend the role of language within multimodal artificial intelligence, moving from its initial grounding in visual systems toward improved usability in representation, evaluation, and generative control. While vision-and-language models (VLMs) have achieved impressive capabilities, they still struggle with two key challenges: integrating language into domains that traditionally rely on purely visual or sensory information, and effectively leveraging the full expressive power of language, particularly when dealing with abstract concepts, compositional semantics, and user-driven creative tasks. The first part of this thesis focuses on introducing language into vision-centric systems. We enhance spatial perception through a Language-enhanced Renderable Neural Radiance Map (Le-RNR-Map), enabling natural language queries and affordance-based navigation within learned visual environments. In industrial inspection, we demonstrate how linguistic knowledge can guide anomaly detection and data generation via diffusion-based methods. Through text-guided defect synthesis and human-in-the-loop feedback, these contributions show that language can improve interpretability, robustness, and human-AI collaboration in practical, domain-specific systems. The second part addresses the challenge of improving the usability of language in multimodal models. We propose a training-free latent adaptation method that strengthens the representation of abstract language in VLMs, enabling better retrieval and semantic alignment. We then introduce a localized evaluation metric for text-to-image models, designed to assess fine-grained compositional correctness between entities and attributes. Finally, we develop a multimodal generation framework that combines textual and sketch-based conditioning for controllable, stepwise diffusion-based fashion design. Together, these contributions outline a trajectory from grounding linguistic meaning in visual tasks to enhancing its expressivity and operational value. The thesis demonstrates that language is not merely an annotation layer for visual data, but a powerful interface that enables richer reasoning, interaction, and creativity in multimodal AI systems.

Introducing and Refining Language in Vision Models

GIRELLA, FEDERICO
2026

Abstract

This thesis investigates how to extend the role of language within multimodal artificial intelligence, moving from its initial grounding in visual systems toward improved usability in representation, evaluation, and generative control. While vision-and-language models (VLMs) have achieved impressive capabilities, they still struggle with two key challenges: integrating language into domains that traditionally rely on purely visual or sensory information, and effectively leveraging the full expressive power of language, particularly when dealing with abstract concepts, compositional semantics, and user-driven creative tasks. The first part of this thesis focuses on introducing language into vision-centric systems. We enhance spatial perception through a Language-enhanced Renderable Neural Radiance Map (Le-RNR-Map), enabling natural language queries and affordance-based navigation within learned visual environments. In industrial inspection, we demonstrate how linguistic knowledge can guide anomaly detection and data generation via diffusion-based methods. Through text-guided defect synthesis and human-in-the-loop feedback, these contributions show that language can improve interpretability, robustness, and human-AI collaboration in practical, domain-specific systems. The second part addresses the challenge of improving the usability of language in multimodal models. We propose a training-free latent adaptation method that strengthens the representation of abstract language in VLMs, enabling better retrieval and semantic alignment. We then introduce a localized evaluation metric for text-to-image models, designed to assess fine-grained compositional correctness between entities and attributes. Finally, we develop a multimodal generation framework that combines textual and sketch-based conditioning for controllable, stepwise diffusion-based fashion design. Together, these contributions outline a trajectory from grounding linguistic meaning in visual tasks to enhancing its expressivity and operational value. The thesis demonstrates that language is not merely an annotation layer for visual data, but a powerful interface that enables richer reasoning, interaction, and creativity in multimodal AI systems.
2026
Inglese
Cristani Marco
140
File in questo prodotto:
File Dimensione Formato  
Federico_Girella_PhD_Thesis.pdf

accesso aperto

Licenza: Tutti i diritti riservati
Dimensione 27.05 MB
Formato Adobe PDF
27.05 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/359859
Il codice NBN di questa tesi è URN:NBN:IT:UNIVR-359859