This thesis investigates how to extend the role of language within multimodal artificial intelligence, moving from its initial grounding in visual systems toward improved usability in representation, evaluation, and generative control. While vision-and-language models (VLMs) have achieved impressive capabilities, they still struggle with two key challenges: integrating language into domains that traditionally rely on purely visual or sensory information, and effectively leveraging the full expressive power of language, particularly when dealing with abstract concepts, compositional semantics, and user-driven creative tasks. The first part of this thesis focuses on introducing language into vision-centric systems. We enhance spatial perception through a Language-enhanced Renderable Neural Radiance Map (Le-RNR-Map), enabling natural language queries and affordance-based navigation within learned visual environments. In industrial inspection, we demonstrate how linguistic knowledge can guide anomaly detection and data generation via diffusion-based methods. Through text-guided defect synthesis and human-in-the-loop feedback, these contributions show that language can improve interpretability, robustness, and human-AI collaboration in practical, domain-specific systems. The second part addresses the challenge of improving the usability of language in multimodal models. We propose a training-free latent adaptation method that strengthens the representation of abstract language in VLMs, enabling better retrieval and semantic alignment. We then introduce a localized evaluation metric for text-to-image models, designed to assess fine-grained compositional correctness between entities and attributes. Finally, we develop a multimodal generation framework that combines textual and sketch-based conditioning for controllable, stepwise diffusion-based fashion design. Together, these contributions outline a trajectory from grounding linguistic meaning in visual tasks to enhancing its expressivity and operational value. The thesis demonstrates that language is not merely an annotation layer for visual data, but a powerful interface that enables richer reasoning, interaction, and creativity in multimodal AI systems.
Introducing and Refining Language in Vision Models
GIRELLA, FEDERICO
2026
Abstract
This thesis investigates how to extend the role of language within multimodal artificial intelligence, moving from its initial grounding in visual systems toward improved usability in representation, evaluation, and generative control. While vision-and-language models (VLMs) have achieved impressive capabilities, they still struggle with two key challenges: integrating language into domains that traditionally rely on purely visual or sensory information, and effectively leveraging the full expressive power of language, particularly when dealing with abstract concepts, compositional semantics, and user-driven creative tasks. The first part of this thesis focuses on introducing language into vision-centric systems. We enhance spatial perception through a Language-enhanced Renderable Neural Radiance Map (Le-RNR-Map), enabling natural language queries and affordance-based navigation within learned visual environments. In industrial inspection, we demonstrate how linguistic knowledge can guide anomaly detection and data generation via diffusion-based methods. Through text-guided defect synthesis and human-in-the-loop feedback, these contributions show that language can improve interpretability, robustness, and human-AI collaboration in practical, domain-specific systems. The second part addresses the challenge of improving the usability of language in multimodal models. We propose a training-free latent adaptation method that strengthens the representation of abstract language in VLMs, enabling better retrieval and semantic alignment. We then introduce a localized evaluation metric for text-to-image models, designed to assess fine-grained compositional correctness between entities and attributes. Finally, we develop a multimodal generation framework that combines textual and sketch-based conditioning for controllable, stepwise diffusion-based fashion design. Together, these contributions outline a trajectory from grounding linguistic meaning in visual tasks to enhancing its expressivity and operational value. The thesis demonstrates that language is not merely an annotation layer for visual data, but a powerful interface that enables richer reasoning, interaction, and creativity in multimodal AI systems.| File | Dimensione | Formato | |
|---|---|---|---|
|
Federico_Girella_PhD_Thesis.pdf
accesso aperto
Licenza:
Tutti i diritti riservati
Dimensione
27.05 MB
Formato
Adobe PDF
|
27.05 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/359859
URN:NBN:IT:UNIVR-359859