Multimodal Scene Understanding: Continual Learning and Domain Adaptation in Dynamic Environments

Rizzoli, Giulia

This thesis addresses critical challenges in multimodal scene understanding for dynamic, real-world environments. These environments are characterized by constant changes and unpredictability, including variations in lighting, weather, scenes, and data sources, as well as the need to adapt to new contexts or update knowledge over time without losing previous learning. We first introduce DepthFormer, a transformer-based architecture that effectively integrates depth information with color data for improved semantic segmentation. Building on this, we tackle the problem of domain adaptation in multimodal scenarios with MISFIT, a depth-aware framework for source-free adaptation in vision transformers. To address the scarcity of annotated data in aerial imagery, we develop Syndrone, a comprehensive synthetic dataset enabling synthetic-to-real adaptation for UAV vision systems. We further explore decentralized training -- federated learning -- in adverse conditions with HyperFLAW, accommodating heterogeneous agents and introducing weather-aware techniques for semantic segmentation. In the context of continual learning, we present RECALL+, which leverages web-crawled data to mitigate catastrophic forgetting across multiple incremental steps. Finally, we introduce Web-WILSS, a framework for weakly-supervised incremental learning in semantic segmentation using widely available web images.

Multimodal Scene Understanding: Continual Learning and Domain Adaptation in Dynamic Environments

RIZZOLI, GIULIA

2025

Abstract

This thesis addresses critical challenges in multimodal scene understanding for dynamic, real-world environments. These environments are characterized by constant changes and unpredictability, including variations in lighting, weather, scenes, and data sources, as well as the need to adapt to new contexts or update knowledge over time without losing previous learning. We first introduce DepthFormer, a transformer-based architecture that effectively integrates depth information with color data for improved semantic segmentation. Building on this, we tackle the problem of domain adaptation in multimodal scenarios with MISFIT, a depth-aware framework for source-free adaptation in vision transformers. To address the scarcity of annotated data in aerial imagery, we develop Syndrone, a comprehensive synthetic dataset enabling synthetic-to-real adaptation for UAV vision systems. We further explore decentralized training -- federated learning -- in adverse conditions with HyperFLAW, accommodating heterogeneous agents and introducing weather-aware techniques for semantic segmentation. In the context of continual learning, we present RECALL+, which leverages web-crawled data to mitigate catastrophic forgetting across multiple incremental steps. Finally, we introduce Web-WILSS, a framework for weakly-supervised incremental learning in semantic segmentation using widely available web images.

Scheda breve

Scheda completa

Scheda completa (DC)

	Corso di studio
	
				INGEGNERIA DELL'INFORMAZIONE
			
	Data di pubblicazione
	
				24-mar-2025
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				ZANUTTIGH, PIETRO
			
	Nome Editore
	
				Università degli studi di Padova
			
	Collezione di appartenenza
	
				Università degli Studi di Padova

File in questo prodotto:

File	Dimensione	Formato
rizzoli_tesi (1).pdf accesso aperto Dimensione 57.88 MB Formato Adobe PDF Visualizza/Apri	57.88 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/210027

Il codice NBN di questa tesi è URN:NBN:IT:UNIPD-210027