This thesis addresses critical challenges in multimodal scene understanding for dynamic, real-world environments. These environments are characterized by constant changes and unpredictability, including variations in lighting, weather, scenes, and data sources, as well as the need to adapt to new contexts or update knowledge over time without losing previous learning. We first introduce DepthFormer, a transformer-based architecture that effectively integrates depth information with color data for improved semantic segmentation. Building on this, we tackle the problem of domain adaptation in multimodal scenarios with MISFIT, a depth-aware framework for source-free adaptation in vision transformers. To address the scarcity of annotated data in aerial imagery, we develop Syndrone, a comprehensive synthetic dataset enabling synthetic-to-real adaptation for UAV vision systems. We further explore decentralized training -- federated learning -- in adverse conditions with HyperFLAW, accommodating heterogeneous agents and introducing weather-aware techniques for semantic segmentation. In the context of continual learning, we present RECALL+, which leverages web-crawled data to mitigate catastrophic forgetting across multiple incremental steps. Finally, we introduce Web-WILSS, a framework for weakly-supervised incremental learning in semantic segmentation using widely available web images.

Multimodal Scene Understanding: Continual Learning and Domain Adaptation in Dynamic Environments

RIZZOLI, GIULIA
2025

Abstract

This thesis addresses critical challenges in multimodal scene understanding for dynamic, real-world environments. These environments are characterized by constant changes and unpredictability, including variations in lighting, weather, scenes, and data sources, as well as the need to adapt to new contexts or update knowledge over time without losing previous learning. We first introduce DepthFormer, a transformer-based architecture that effectively integrates depth information with color data for improved semantic segmentation. Building on this, we tackle the problem of domain adaptation in multimodal scenarios with MISFIT, a depth-aware framework for source-free adaptation in vision transformers. To address the scarcity of annotated data in aerial imagery, we develop Syndrone, a comprehensive synthetic dataset enabling synthetic-to-real adaptation for UAV vision systems. We further explore decentralized training -- federated learning -- in adverse conditions with HyperFLAW, accommodating heterogeneous agents and introducing weather-aware techniques for semantic segmentation. In the context of continual learning, we present RECALL+, which leverages web-crawled data to mitigate catastrophic forgetting across multiple incremental steps. Finally, we introduce Web-WILSS, a framework for weakly-supervised incremental learning in semantic segmentation using widely available web images.
24-mar-2025
Inglese
ZANUTTIGH, PIETRO
Università degli studi di Padova
File in questo prodotto:
File Dimensione Formato  
rizzoli_tesi (1).pdf

accesso aperto

Dimensione 57.88 MB
Formato Adobe PDF
57.88 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/210027
Il codice NBN di questa tesi è URN:NBN:IT:UNIPD-210027