Capitalizing on self-supervision and pre-trained models in computer vision

Rahman, Muhammad Rameez Ur

This thesis addresses the overarching challenge of advancing computer vision tasks under the constraints of limited labeled data and the imperative to capitalize on pre-existing knowledge encoded in pre-trained models. By exploring three distinct computer vision tasks - classification, regression, and segmentation - this work presents diverse frameworks aimed at transcending the conventional boundaries imposed by data scarcity and task-specific methodologies. The first focus lies on Unsupervised Domain Adaptation (UDA) in visual recognition, a critical endeavor in bridging disparate visual domains for robust real-world performance. Existing approaches in UDA typically necessitate manual adaptation to specific backbone architectures, hindering adaptability over time as methods become outdated with evolving architectures. To circumvent this limitation, this thesis proposes a novel approach termed Adversarial Branch Architecture Search for UDA (ABAS). ABAS addresses the lack of target labels by employing a data-driven ensemble approach for model selection and explores auxiliary adversarial branches to drive domain alignment. Extensive validation on standard visual recognition datasets demonstrates ABAS's efficacy in enhancing modern UDA techniques, robustly yielding superior performances across diverse domains. In the realm of regression tasks, the thesis delves into collaborative human pose forecasting, an understudied domain with the potential for improved performance through exploiting the correlated motion patterns of interacting individuals. By revisiting prevalent single-person practices and tailoring them to the collaborative setting, significant advancements are achieved. Notably, the integration of frequency input representations, space-time separable interaction encodings, and fully-learnable interaction adjacencies into a Graph Convolutional Network (GCN) framework showcases promising results. Furthermore, a novel initialization procedure for spatial interaction parameters enhances both performance and stability, culminating in a substantial performance boost over state-of-the-art methods on benchmark datasets. Lastly, the thesis tackles semantic segmentation in autonomous driving scenarios, leveraging the unique capabilities of event cameras for low-latency operation in challenging lighting conditions. We introduce OVOSE, the first open-vocabulary semantic segmentation approach explicitly tailored for event-based data. OVOSE leverages knowledge distillation from pre-trained image-based models and synthetic event data to enhance segmentation performance. Additionally, we propose a novel dissimilarity network to recalibrate mask loss, mitigating the effects of sub-optimal reconstructions and enabling precise fine-tuning of the segmentation model. Through this novel approach, OVOSE demonstrates superior performance in dynamic environments, outperforming existing conventional image-based models and state-of-the-art methods in unsupervised domain adaptation for event-based semantic segmentation. In summary, this thesis presents a holistic approach to computer vision tasks, unifying disparate methodologies under the common goal of leveraging pre-trained models and limited labels to achieve superior performance across diverse domains. By addressing specific challenges within classification, regression, and segmentation tasks, the proposed frameworks contributes towards advancing the frontier of computer vision in real-world applications.

Capitalizing on self-supervision and pre-trained models in computer vision

RAHMAN, MUHAMMAD RAMEEZ UR

2024

Abstract

This thesis addresses the overarching challenge of advancing computer vision tasks under the constraints of limited labeled data and the imperative to capitalize on pre-existing knowledge encoded in pre-trained models. By exploring three distinct computer vision tasks - classification, regression, and segmentation - this work presents diverse frameworks aimed at transcending the conventional boundaries imposed by data scarcity and task-specific methodologies. The first focus lies on Unsupervised Domain Adaptation (UDA) in visual recognition, a critical endeavor in bridging disparate visual domains for robust real-world performance. Existing approaches in UDA typically necessitate manual adaptation to specific backbone architectures, hindering adaptability over time as methods become outdated with evolving architectures. To circumvent this limitation, this thesis proposes a novel approach termed Adversarial Branch Architecture Search for UDA (ABAS). ABAS addresses the lack of target labels by employing a data-driven ensemble approach for model selection and explores auxiliary adversarial branches to drive domain alignment. Extensive validation on standard visual recognition datasets demonstrates ABAS's efficacy in enhancing modern UDA techniques, robustly yielding superior performances across diverse domains. In the realm of regression tasks, the thesis delves into collaborative human pose forecasting, an understudied domain with the potential for improved performance through exploiting the correlated motion patterns of interacting individuals. By revisiting prevalent single-person practices and tailoring them to the collaborative setting, significant advancements are achieved. Notably, the integration of frequency input representations, space-time separable interaction encodings, and fully-learnable interaction adjacencies into a Graph Convolutional Network (GCN) framework showcases promising results. Furthermore, a novel initialization procedure for spatial interaction parameters enhances both performance and stability, culminating in a substantial performance boost over state-of-the-art methods on benchmark datasets. Lastly, the thesis tackles semantic segmentation in autonomous driving scenarios, leveraging the unique capabilities of event cameras for low-latency operation in challenging lighting conditions. We introduce OVOSE, the first open-vocabulary semantic segmentation approach explicitly tailored for event-based data. OVOSE leverages knowledge distillation from pre-trained image-based models and synthetic event data to enhance segmentation performance. Additionally, we propose a novel dissimilarity network to recalibrate mask loss, mitigating the effects of sub-optimal reconstructions and enabling precise fine-tuning of the segmentation model. Through this novel approach, OVOSE demonstrates superior performance in dynamic environments, outperforming existing conventional image-based models and state-of-the-art methods in unsupervised domain adaptation for event-based semantic segmentation. In summary, this thesis presents a holistic approach to computer vision tasks, unifying disparate methodologies under the common goal of leveraging pre-trained models and limited labels to achieve superior performance across diverse domains. By addressing specific challenges within classification, regression, and segmentation tasks, the proposed frameworks contributes towards advancing the frontier of computer vision in real-world applications.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				DIPARTIMENTO DI INFORMATICA
			
	Corso di studio
	
				Informatica
			
	Data di pubblicazione
	
				28-mag-2024
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				GALASSO, FABIO
			
	Correlatore, Controrelatore, Co-Supervisor,  Co-Tutor o Coordinatori
	
				MANCINI, MAURIZIO
			
	Nome Editore
	
				Università degli Studi di Roma "La Sapienza"
			
	Numero di pagine
	
				81
			
	Collezione di appartenenza
	
				Università degli Studi di Roma La Sapienza

File in questo prodotto:

File	Dimensione	Formato
Tesi_dottorato_Rahman.pdf Open Access dal 29/05/2025 Licenza: Tutti i diritti riservati Dimensione 12.17 MB Formato Adobe PDF Visualizza/Apri	12.17 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/126729

Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA1-126729