From Retrieval to Generation: Multimodal Models for Vision and Language Tasks

Baldrati, Alberto

This thesis explores the integration of visual and textual data across various tasks, from image retrieval to image generation. To address discriminative and generative tasks, we examine two types of multimodal models: Vision-Language Models (VLMs) for retrieval and classification, and text-to-image diffusion models for generative problems. We begin by tackling the task of supervised Composed Image Retrieval (CIR), where the goal is to retrieve a target image based on a query composed of a reference image and a relative caption that describes the difference between the two images. In this context, we propose a two-stage approach that adapts the CLIP model to CIR, achieving state-of-the-art results on the benchmark datasets FashionIQ and CIRR. Building on this, we then introduce a new task called Zero-Shot CIR (ZS-CIR), which aims to address CIR without requiring a labeled training dataset. The proposed method, iSEARLE (improved zeroShot composEd imAge Retrieval with textuaL invErsion), sets a new standard for ZS-CIR. Additionally, we support further research in this area by introducing the CIRCO dataset, the first CIR dataset to include multiple ground-truth images for each query, enabling a more comprehensive evaluation. Following this, we explore the few-shot adaptation of VLMs by introducing KDPL (Knowledge Distillation Prompt Learning), a parameter-efficient, unsupervised prompt learning method based on knowledge distillation. KDPL can be seamlessly integrated into existing prompt learning techniques, with experiments demonstrating significant improvements in zero-shot generalization across more than ten benchmark datasets. Next, the focus shifts to generative tasks in the fashion domain. We propose a latent diffusion model for multimodal fashion image editing, capable of generating realistic, human-centric fashion images from diverse inputs such as text descriptions, body poses, sketches, and fabric textures. To address the lack of appropriate datasets, we extend the Dress Code and VITON-HD datasets with multimodal annotations. Finally, we further explore generative models within the fashion domain by tackling the virtual try-on task. We introduce LaDI-VTON, the first diffusion-based model for this task, which employs textual inversion to preserve garment texture details during image generation, offering significant improvements over existing methods.

From Retrieval to Generation: Multimodal Models for Vision and Language Tasks

BALDRATI, ALBERTO

2025

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

	Data di pubblicazione
	
				17-feb-2025
			
	Lingua
	
				Italiano
			
	Parola chiave
	
				artificial intelligence
computer vision
fashion image generation
image retrieval
virtual try-on
vision-language models
			
	Relatore, Supervisor, Advisor o Tutor
	
				Bertini, Marco
			
	Correlatore, Controrelatore, Co-Supervisor,  Co-Tutor o Coordinatori
	
				Bagdanov, Andrew David
			
	Collezione di appartenenza
	
				Università degli Studi di Pisa

File in questo prodotto:

File	Dimensione	Formato
AlbertoBaldratiPhDThesisFinalPdfa.pdf accesso aperto Licenza: Tutti i diritti riservati Dimensione 110.9 MB Formato Adobe PDF Visualizza/Apri	110.9 MB	Adobe PDF	Visualizza/Apri
Final_Report_compiled.pdf non disponibili Licenza: Tutti i diritti riservati Dimensione 583.79 kB Formato Adobe PDF	583.79 kB	Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/216376

Il codice NBN di questa tesi è URN:NBN:IT:UNIPI-216376