The emergence of vision-language foundation models, such as CLIP and DALL·E, has significantly advanced the use of natural language for image generation and manipulation. These models learn broad alignments between visual and textual representations, enabling flexible and general-purpose multimodal capabilities. However, using language as an efficient and robust interface for image generation remains challenging, with limitations in controllability, semantic expressiveness, and visual fidelity. This thesis focuses on several fundamental challenges in employing language as a control interface for image generation with foundation models. Specifically, it investigates (1) enhancing the controllability and precision of language-guided generation, (2) leveraging foundation models to explore and utilize generative latent spaces, and (3) developing a spectral perspective on CLIP embeddings to better analyze and improve generation quality. To address these aspects, we introduce three complementary approaches: Predict, Prevent, and Evaluate (PPE) enhances the controllability and precision of language-guided image manipulation by modeling and regularizing attribute interactions with natural language. StylerDALLE explores the latent space of generative models, formulating style transfer as latent translations, supervised through CLIP-based reinforcement learning to jointly preserve style and content in generation. SpectralCLIP analyzes the frequency spectrum of CLIP embeddings to suppress common artifacts in CLIP-guided generation, improving output robustness without compromising semantic alignment. Together, these contributions highlight the potential of natural language as a flexible and high-level interface for visual generation, grounded in the capabilities of vision-language foundation models. This thesis demonstrates how language, when appropriately modeled and guided, can effectively control diverse aspects of the generative process.

On Language as an Interface for Controlling Image Generation

Xu, Zipeng
2025

Abstract

The emergence of vision-language foundation models, such as CLIP and DALL·E, has significantly advanced the use of natural language for image generation and manipulation. These models learn broad alignments between visual and textual representations, enabling flexible and general-purpose multimodal capabilities. However, using language as an efficient and robust interface for image generation remains challenging, with limitations in controllability, semantic expressiveness, and visual fidelity. This thesis focuses on several fundamental challenges in employing language as a control interface for image generation with foundation models. Specifically, it investigates (1) enhancing the controllability and precision of language-guided generation, (2) leveraging foundation models to explore and utilize generative latent spaces, and (3) developing a spectral perspective on CLIP embeddings to better analyze and improve generation quality. To address these aspects, we introduce three complementary approaches: Predict, Prevent, and Evaluate (PPE) enhances the controllability and precision of language-guided image manipulation by modeling and regularizing attribute interactions with natural language. StylerDALLE explores the latent space of generative models, formulating style transfer as latent translations, supervised through CLIP-based reinforcement learning to jointly preserve style and content in generation. SpectralCLIP analyzes the frequency spectrum of CLIP embeddings to suppress common artifacts in CLIP-guided generation, improving output robustness without compromising semantic alignment. Together, these contributions highlight the potential of natural language as a flexible and high-level interface for visual generation, grounded in the capabilities of vision-language foundation models. This thesis demonstrates how language, when appropriately modeled and guided, can effectively control diverse aspects of the generative process.
9-dic-2025
Inglese
Sebe, Niculae
Università degli studi di Trento
TRENTO
116
File in questo prodotto:
File Dimensione Formato  
Phd_thesis (1).pdf

accesso aperto

Licenza: Tutti i diritti riservati
Dimensione 54.93 MB
Formato Adobe PDF
54.93 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/352986
Il codice NBN di questa tesi è URN:NBN:IT:UNITN-352986