This thesis explores the integration of visual and textual data across various tasks, from image retrieval to image generation. To address discriminative and generative tasks, we examine two types of multimodal models: Vision-Language Models (VLMs) for retrieval and classification, and text-to-image diffusion models for generative problems. We begin by tackling the task of supervised Composed Image Retrieval (CIR), where the goal is to retrieve a target image based on a query composed of a reference image and a relative caption that describes the difference between the two images. In this context, we propose a two-stage approach that adapts the CLIP model to CIR, achieving state-of-the-art results on the benchmark datasets FashionIQ and CIRR. Building on this, we then introduce a new task called Zero-Shot CIR (ZS-CIR), which aims to address CIR without requiring a labeled training dataset. The proposed method, iSEARLE (improved zeroShot composEd imAge Retrieval with textuaL invErsion), sets a new standard for ZS-CIR. Additionally, we support further research in this area by introducing the CIRCO dataset, the first CIR dataset to include multiple ground-truth images for each query, enabling a more comprehensive evaluation. Following this, we explore the few-shot adaptation of VLMs by introducing KDPL (Knowledge Distillation Prompt Learning), a parameter-efficient, unsupervised prompt learning method based on knowledge distillation. KDPL can be seamlessly integrated into existing prompt learning techniques, with experiments demonstrating significant improvements in zero-shot generalization across more than ten benchmark datasets. Next, the focus shifts to generative tasks in the fashion domain. We propose a latent diffusion model for multimodal fashion image editing, capable of generating realistic, human-centric fashion images from diverse inputs such as text descriptions, body poses, sketches, and fabric textures. To address the lack of appropriate datasets, we extend the Dress Code and VITON-HD datasets with multimodal annotations. Finally, we further explore generative models within the fashion domain by tackling the virtual try-on task. We introduce LaDI-VTON, the first diffusion-based model for this task, which employs textual inversion to preserve garment texture details during image generation, offering significant improvements over existing methods.

From Retrieval to Generation: Multimodal Models for Vision and Language Tasks

BALDRATI, ALBERTO
2025

Abstract

This thesis explores the integration of visual and textual data across various tasks, from image retrieval to image generation. To address discriminative and generative tasks, we examine two types of multimodal models: Vision-Language Models (VLMs) for retrieval and classification, and text-to-image diffusion models for generative problems. We begin by tackling the task of supervised Composed Image Retrieval (CIR), where the goal is to retrieve a target image based on a query composed of a reference image and a relative caption that describes the difference between the two images. In this context, we propose a two-stage approach that adapts the CLIP model to CIR, achieving state-of-the-art results on the benchmark datasets FashionIQ and CIRR. Building on this, we then introduce a new task called Zero-Shot CIR (ZS-CIR), which aims to address CIR without requiring a labeled training dataset. The proposed method, iSEARLE (improved zeroShot composEd imAge Retrieval with textuaL invErsion), sets a new standard for ZS-CIR. Additionally, we support further research in this area by introducing the CIRCO dataset, the first CIR dataset to include multiple ground-truth images for each query, enabling a more comprehensive evaluation. Following this, we explore the few-shot adaptation of VLMs by introducing KDPL (Knowledge Distillation Prompt Learning), a parameter-efficient, unsupervised prompt learning method based on knowledge distillation. KDPL can be seamlessly integrated into existing prompt learning techniques, with experiments demonstrating significant improvements in zero-shot generalization across more than ten benchmark datasets. Next, the focus shifts to generative tasks in the fashion domain. We propose a latent diffusion model for multimodal fashion image editing, capable of generating realistic, human-centric fashion images from diverse inputs such as text descriptions, body poses, sketches, and fabric textures. To address the lack of appropriate datasets, we extend the Dress Code and VITON-HD datasets with multimodal annotations. Finally, we further explore generative models within the fashion domain by tackling the virtual try-on task. We introduce LaDI-VTON, the first diffusion-based model for this task, which employs textual inversion to preserve garment texture details during image generation, offering significant improvements over existing methods.
17-feb-2025
Italiano
artificial intelligence
computer vision
fashion image generation
image retrieval
virtual try-on
vision-language models
Bertini, Marco
Bagdanov, Andrew David
File in questo prodotto:
File Dimensione Formato  
AlbertoBaldratiPhDThesisFinalPdfa.pdf

accesso aperto

Licenza: Tutti i diritti riservati
Dimensione 110.9 MB
Formato Adobe PDF
110.9 MB Adobe PDF Visualizza/Apri
Final_Report_compiled.pdf

non disponibili

Licenza: Tutti i diritti riservati
Dimensione 583.79 kB
Formato Adobe PDF
583.79 kB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/216376
Il codice NBN di questa tesi è URN:NBN:IT:UNIPI-216376