This thesis explores the integration of visual and textual data across various tasks, from image retrieval to image generation. To address discriminative and generative tasks, we examine two types of multimodal models: Vision-Language Models (VLMs) for retrieval and classification, and text-to-image diffusion models for generative problems. We begin by tackling the task of supervised Composed Image Retrieval (CIR), where the goal is to retrieve a target image based on a query composed of a reference image and a relative caption that describes the difference between the two images. In this context, we propose a two-stage approach that adapts the CLIP model to CIR, achieving state-of-the-art results on the benchmark datasets FashionIQ and CIRR. Building on this, we then introduce a new task called Zero-Shot CIR (ZS-CIR), which aims to address CIR without requiring a labeled training dataset. The proposed method, iSEARLE (improved zeroShot composEd imAge Retrieval with textuaL invErsion), sets a new standard for ZS-CIR. Additionally, we support further research in this area by introducing the CIRCO dataset, the first CIR dataset to include multiple ground-truth images for each query, enabling a more comprehensive evaluation. Following this, we explore the few-shot adaptation of VLMs by introducing KDPL (Knowledge Distillation Prompt Learning), a parameter-efficient, unsupervised prompt learning method based on knowledge distillation. KDPL can be seamlessly integrated into existing prompt learning techniques, with experiments demonstrating significant improvements in zero-shot generalization across more than ten benchmark datasets. Next, the focus shifts to generative tasks in the fashion domain. We propose a latent diffusion model for multimodal fashion image editing, capable of generating realistic, human-centric fashion images from diverse inputs such as text descriptions, body poses, sketches, and fabric textures. To address the lack of appropriate datasets, we extend the Dress Code and VITON-HD datasets with multimodal annotations. Finally, we further explore generative models within the fashion domain by tackling the virtual try-on task. We introduce LaDI-VTON, the first diffusion-based model for this task, which employs textual inversion to preserve garment texture details during image generation, offering significant improvements over existing methods.
From Retrieval to Generation: Multimodal Models for Vision and Language Tasks
BALDRATI, ALBERTO
2025
Abstract
This thesis explores the integration of visual and textual data across various tasks, from image retrieval to image generation. To address discriminative and generative tasks, we examine two types of multimodal models: Vision-Language Models (VLMs) for retrieval and classification, and text-to-image diffusion models for generative problems. We begin by tackling the task of supervised Composed Image Retrieval (CIR), where the goal is to retrieve a target image based on a query composed of a reference image and a relative caption that describes the difference between the two images. In this context, we propose a two-stage approach that adapts the CLIP model to CIR, achieving state-of-the-art results on the benchmark datasets FashionIQ and CIRR. Building on this, we then introduce a new task called Zero-Shot CIR (ZS-CIR), which aims to address CIR without requiring a labeled training dataset. The proposed method, iSEARLE (improved zeroShot composEd imAge Retrieval with textuaL invErsion), sets a new standard for ZS-CIR. Additionally, we support further research in this area by introducing the CIRCO dataset, the first CIR dataset to include multiple ground-truth images for each query, enabling a more comprehensive evaluation. Following this, we explore the few-shot adaptation of VLMs by introducing KDPL (Knowledge Distillation Prompt Learning), a parameter-efficient, unsupervised prompt learning method based on knowledge distillation. KDPL can be seamlessly integrated into existing prompt learning techniques, with experiments demonstrating significant improvements in zero-shot generalization across more than ten benchmark datasets. Next, the focus shifts to generative tasks in the fashion domain. We propose a latent diffusion model for multimodal fashion image editing, capable of generating realistic, human-centric fashion images from diverse inputs such as text descriptions, body poses, sketches, and fabric textures. To address the lack of appropriate datasets, we extend the Dress Code and VITON-HD datasets with multimodal annotations. Finally, we further explore generative models within the fashion domain by tackling the virtual try-on task. We introduce LaDI-VTON, the first diffusion-based model for this task, which employs textual inversion to preserve garment texture details during image generation, offering significant improvements over existing methods.| File | Dimensione | Formato | |
|---|---|---|---|
|
AlbertoBaldratiPhDThesisFinalPdfa.pdf
accesso aperto
Licenza:
Tutti i diritti riservati
Dimensione
110.9 MB
Formato
Adobe PDF
|
110.9 MB | Adobe PDF | Visualizza/Apri |
|
Final_Report_compiled.pdf
non disponibili
Licenza:
Tutti i diritti riservati
Dimensione
583.79 kB
Formato
Adobe PDF
|
583.79 kB | Adobe PDF |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/216376
URN:NBN:IT:UNIPI-216376