In Artificial Intelligence, the introduction of the attention mechanism and the Transformer architecture has enabled models capable of processing multiple modalities at unprecedented scale. This breakthrough is largely due to the flexibility of the attention operator and the adaptability of the architecture, which have given rise to a new generation of vision-language systems. Among the tasks at the intersection of Computer Vision, Natural Language Processing, and Multimedia, image captioning (i.e., the task of generating natural language descriptions of visual content) has played a pivotal role. In the era of modern Multimodal Large Language Models (MLLMs), captioning remains a fundamental component, now coexisting with multimodal tasks such as Visual Question Answering (VQA). To further enhance the capabilities of such models, retrieval augmentation has emerged as a key strategy. Enriching models with relevant external knowledge improves factual grounding and adaptability, enabling more accurate and context-aware responses, particularly in knowledgeintensive or domain-specific scenarios. This thesis represents the natural evolution of retrieval augmentation, moving from its early application in image captioning to its integration within modern MLLMs. Each stage builds upon the insights and challenges encountered along the way, addressing open problems in evaluation and retrieval effectiveness. I The first part of the thesis establishes the foundations of retrievalaugmented vision–language models. It analyzes classical cross-modal retrieval and extends beyond standard settings to address more complex scenarios, including multimodal queries and heterogeneous documents. A central insight is that retrieval quality critically determines performance, particularly as multimodal applications increasingly involve both multimodal queries and document collections. In response, this work we present new retrievers (ReT and ReT-2) tailored for multimodal scenarios. Building on this, this work investigates retrieval-augmented architectures for captioning through the introduction of the RA-Transformer, in which external knowledge is integrated into the caption generation, providing cues to generate richer and more precise descriptions. The thesis then extends retrieval augmentation to MLLMs, motivated by the fact that even large-scale pretraining struggles with domain-specific or knowledge-intensive queries. Specifically, WikiLLaVA introduces retrievalaugmented MLLM architectures for knowledge-based VQA, where retrieval mechanisms are used to enhance reasoning capabilities and adaptability to complex multimodal queries. Throughout this research, it becomes evident that the advancement of captioning models is constrained by the lack of robust and reliable evaluation metrics. Traditional metrics, while widely adopted, often fail to capture semantic adequacy, factual grounding, and fluency. To address this, a major contribution of this thesis is the design and analysis of new evaluation metrics for image captioning – i.e., PAC-S, BRIDGE, and an improved version of PAC-S. These metrics are specifically designed to align with human judgment and capture the multifaceted quality of captions. Beyond their introduction, the thesis investigates their application across different benchmarks and domains, including their ability to evaluate captions generated by MLLMs, reflecting the shift of captioning from a standalone task to an integral component of broader multimodal reasoning systems. II Overall, this thesis, through novel retrieval-augmented captioning architectures, improved evaluation metrics, and specialized multimodal retrievers, contributes new methodologies, tools, and insights that advance the field of multimodal AI.
Nel campo dell’Intelligenza Artificiale (IA), l’introduzione del meccanismo di attention e dell’architettura Transformer ha reso possibili modelli in grado di elaborare più modalità su scala senza precedenti. Questa svolta è dovuta alla flessibilità dell’operatore di attention e all’adattabilità dell’architettura, che hanno dato origine a una nuova generazione di sistemi visione-linguaggio. Tra i task all’intersezione tra Computer Vision, Natural Language Processing e Multimedia, l’image captioning, ovvero la generazione di descrizioni in linguaggio naturale a partire da contenuti visivi, ha svolto un ruolo centrale. Nell’era dei Multimodal Large Language Models (MLLMs), il captioning resta fondamentale, affiancato da task multimodali come il Visual Question Answering (VQA). Per potenziare tali modelli, la retrieval augmentation è emersa come strategia chiave. L’arricchimento con conoscenza esterna rilevante migliora l’adattabilità e consente risposte più accurate e sensibili al contesto, soprattutto in scenari complessi o specialistici. Questa tesi rappresenta l’evoluzione naturale della retrieval augmentation, passando dalle sue prime applicazioni nell’image captioning all'integrazione nei moderni MLLMs. Ogni fase si basa sulle intuizioni e sulle sfide incontrate, affrontando problemi aperti legati alla valutazione e all’efficacia del retrieval. La prima parte della tesi stabilisce le basi dei modelli visione-linguaggio con retrieval augmentation. Vengono analizzate tecniche classiche di cross-modal retrieval ed estese a scenari più complessi, inclusi query multimodali e collezioni documentali eterogenee. Un’intuizione centrale è che la qualità del retrieval influenzi in modo critico le prestazioni complessive. In risposta a ciò, vengono introdotti nuovi retriever multimodali, ReT e ReT-2, progettati per tali scenari. La tesi indaga anche architetture di captioning con retrieval augmentation attraverso l’introduzione del RA-Transformer, in cui la conoscenza esterna viene integrata direttamente nel processo di generazione, fornendo segnali utili a produrre caption più ricche e precise. Successivamente, il lavoro estende la retrieval augmentation ai MLLMs, motivato dal fatto che anche il pretraining su larga scala mostra limiti nell’affrontare query knowledge-intensive o specifiche di dominio. In particolare, WikiLLaVA introduce architetture MLLM con retrieval augmentation per il knowledge-based VQA, in cui i meccanismi di retrieval potenziano le capacità di ragionamento e l’adattabilità a query multimodali complesse. Nel corso della ricerca emerge come il progresso dei modelli di captioning sia limitato dalla mancanza di metriche di valutazione robuste e affidabili. Le metriche tradizionali, sebbene ampiamente utilizzate, spesso non riescono a catturare adeguatezza semantica, grounding fattuale e fluidità linguistica. Quindi, un contributo di questa tesi è la progettazione e l’analisi di nuove metriche di valutazione per l’image captioning, ovvero PAC-S, BRIDGE e una versione migliorata di PAC-S. Tali metriche sono progettate per allinearsi al giudizio umano e per catturare la qualità delle descrizioni. La tesi ne analizza anche l’applicazione su diversi benchmark e domini, inclusa la loro capacità di valutare caption generate da MLLMs, riflettendo il passaggio del captioning da compito autonomo a componente di sistemi di ragionamento multimodale più ampi. Nel complesso, attraverso nuove architetture di captioning con retrieval augmentation, retriever multimodali e metriche di valutazione, questa tesi fornisce metodologie, strumenti e contributi che avanzano lo stato dell’arte nell’ambito dell’Intelligenza Artificiale multimodale.
Multimodal Understanding tramite Retrieval-Augmentation: dai Modelli alla Valutazione
SARTO, SARA
2026
Abstract
In Artificial Intelligence, the introduction of the attention mechanism and the Transformer architecture has enabled models capable of processing multiple modalities at unprecedented scale. This breakthrough is largely due to the flexibility of the attention operator and the adaptability of the architecture, which have given rise to a new generation of vision-language systems. Among the tasks at the intersection of Computer Vision, Natural Language Processing, and Multimedia, image captioning (i.e., the task of generating natural language descriptions of visual content) has played a pivotal role. In the era of modern Multimodal Large Language Models (MLLMs), captioning remains a fundamental component, now coexisting with multimodal tasks such as Visual Question Answering (VQA). To further enhance the capabilities of such models, retrieval augmentation has emerged as a key strategy. Enriching models with relevant external knowledge improves factual grounding and adaptability, enabling more accurate and context-aware responses, particularly in knowledgeintensive or domain-specific scenarios. This thesis represents the natural evolution of retrieval augmentation, moving from its early application in image captioning to its integration within modern MLLMs. Each stage builds upon the insights and challenges encountered along the way, addressing open problems in evaluation and retrieval effectiveness. I The first part of the thesis establishes the foundations of retrievalaugmented vision–language models. It analyzes classical cross-modal retrieval and extends beyond standard settings to address more complex scenarios, including multimodal queries and heterogeneous documents. A central insight is that retrieval quality critically determines performance, particularly as multimodal applications increasingly involve both multimodal queries and document collections. In response, this work we present new retrievers (ReT and ReT-2) tailored for multimodal scenarios. Building on this, this work investigates retrieval-augmented architectures for captioning through the introduction of the RA-Transformer, in which external knowledge is integrated into the caption generation, providing cues to generate richer and more precise descriptions. The thesis then extends retrieval augmentation to MLLMs, motivated by the fact that even large-scale pretraining struggles with domain-specific or knowledge-intensive queries. Specifically, WikiLLaVA introduces retrievalaugmented MLLM architectures for knowledge-based VQA, where retrieval mechanisms are used to enhance reasoning capabilities and adaptability to complex multimodal queries. Throughout this research, it becomes evident that the advancement of captioning models is constrained by the lack of robust and reliable evaluation metrics. Traditional metrics, while widely adopted, often fail to capture semantic adequacy, factual grounding, and fluency. To address this, a major contribution of this thesis is the design and analysis of new evaluation metrics for image captioning – i.e., PAC-S, BRIDGE, and an improved version of PAC-S. These metrics are specifically designed to align with human judgment and capture the multifaceted quality of captions. Beyond their introduction, the thesis investigates their application across different benchmarks and domains, including their ability to evaluate captions generated by MLLMs, reflecting the shift of captioning from a standalone task to an integral component of broader multimodal reasoning systems. II Overall, this thesis, through novel retrieval-augmented captioning architectures, improved evaluation metrics, and specialized multimodal retrievers, contributes new methodologies, tools, and insights that advance the field of multimodal AI.| File | Dimensione | Formato | |
|---|---|---|---|
|
Sarto.pdf
accesso aperto
Licenza:
Tutti i diritti riservati
Dimensione
9.51 MB
Formato
Adobe PDF
|
9.51 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/362899
URN:NBN:IT:UNIMORE-362899