This thesis investigates how the reliance on supervision can be reduced across the entire deep learning pipeline. In the training phase, we explore unsupervised fine-tuning, focusing on Source-Free Unsupervised Domain Adaptation scenarios in visual tasks such as Facial Expression Recognition and video-based Action Recognition, primarily leveraging self-supervision and self-training. At inference, we address the challenge of removing fixed output vocabularies from Vision Language Models by formalizing the tasks of Vocabulary-free Image Classification and Vocabulary-free Semantic Segmentation and by introducing a family of efficient methods that adapt CLIP to the tasks. We also evaluate Large Multimodal Models under a similar constrained scenario, analyzing their predictions, categorizing their mistakes, and proposing tailored solutions to optimize their performance. Finally, we investigate unsupervised evaluation by proposing a framework that uses a Large Language Model and modular tools to automatically generate, execute, and interpret evaluation experiments for Large Multimodal Models without ground-truth labels. By reducing the need for human supervision at every stage of the deep learning pipeline, this thesis contributes toward a more flexible and efficient paradigm for developing and deploying deep neural networks in real-world, data-scarce, and open-ended settings.

Learning without Labels - Reducing Supervision in Training, Inference, and Evaluation of Deep Neural Networks

Conti, Alessandro
2025

Abstract

This thesis investigates how the reliance on supervision can be reduced across the entire deep learning pipeline. In the training phase, we explore unsupervised fine-tuning, focusing on Source-Free Unsupervised Domain Adaptation scenarios in visual tasks such as Facial Expression Recognition and video-based Action Recognition, primarily leveraging self-supervision and self-training. At inference, we address the challenge of removing fixed output vocabularies from Vision Language Models by formalizing the tasks of Vocabulary-free Image Classification and Vocabulary-free Semantic Segmentation and by introducing a family of efficient methods that adapt CLIP to the tasks. We also evaluate Large Multimodal Models under a similar constrained scenario, analyzing their predictions, categorizing their mistakes, and proposing tailored solutions to optimize their performance. Finally, we investigate unsupervised evaluation by proposing a framework that uses a Large Language Model and modular tools to automatically generate, execute, and interpret evaluation experiments for Large Multimodal Models without ground-truth labels. By reducing the need for human supervision at every stage of the deep learning pipeline, this thesis contributes toward a more flexible and efficient paradigm for developing and deploying deep neural networks in real-world, data-scarce, and open-ended settings.
17-lug-2025
Inglese
Ricci, Elisa
Rota, Paolo
Università degli studi di Trento
TRENTO
195
File in questo prodotto:
File Dimensione Formato  
output.pdf

accesso aperto

Dimensione 8.06 MB
Formato Adobe PDF
8.06 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/218077
Il codice NBN di questa tesi è URN:NBN:IT:UNITN-218077