This thesis addresses critical challenges in multimodal scene understanding for dynamic, real-world environments. These environments are characterized by constant changes and unpredictability, including variations in lighting, weather, scenes, and data sources, as well as the need to adapt to new contexts or update knowledge over time without losing previous learning. We first introduce DepthFormer, a transformer-based architecture that effectively integrates depth information with color data for improved semantic segmentation. Building on this, we tackle the problem of domain adaptation in multimodal scenarios with MISFIT, a depth-aware framework for source-free adaptation in vision transformers. To address the scarcity of annotated data in aerial imagery, we develop Syndrone, a comprehensive synthetic dataset enabling synthetic-to-real adaptation for UAV vision systems. We further explore decentralized training -- federated learning -- in adverse conditions with HyperFLAW, accommodating heterogeneous agents and introducing weather-aware techniques for semantic segmentation. In the context of continual learning, we present RECALL+, which leverages web-crawled data to mitigate catastrophic forgetting across multiple incremental steps. Finally, we introduce Web-WILSS, a framework for weakly-supervised incremental learning in semantic segmentation using widely available web images.
Multimodal Scene Understanding: Continual Learning and Domain Adaptation in Dynamic Environments
RIZZOLI, GIULIA
2025
Abstract
This thesis addresses critical challenges in multimodal scene understanding for dynamic, real-world environments. These environments are characterized by constant changes and unpredictability, including variations in lighting, weather, scenes, and data sources, as well as the need to adapt to new contexts or update knowledge over time without losing previous learning. We first introduce DepthFormer, a transformer-based architecture that effectively integrates depth information with color data for improved semantic segmentation. Building on this, we tackle the problem of domain adaptation in multimodal scenarios with MISFIT, a depth-aware framework for source-free adaptation in vision transformers. To address the scarcity of annotated data in aerial imagery, we develop Syndrone, a comprehensive synthetic dataset enabling synthetic-to-real adaptation for UAV vision systems. We further explore decentralized training -- federated learning -- in adverse conditions with HyperFLAW, accommodating heterogeneous agents and introducing weather-aware techniques for semantic segmentation. In the context of continual learning, we present RECALL+, which leverages web-crawled data to mitigate catastrophic forgetting across multiple incremental steps. Finally, we introduce Web-WILSS, a framework for weakly-supervised incremental learning in semantic segmentation using widely available web images.File | Dimensione | Formato | |
---|---|---|---|
rizzoli_tesi (1).pdf
accesso aperto
Dimensione
57.88 MB
Formato
Adobe PDF
|
57.88 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/210027
URN:NBN:IT:UNIPD-210027