Real-Time Monocular Scene Analysis for UAV in Outdoor Environments

Abdelmottaleb, Yara Alaaeldin Abdelaziz

Understanding the geometric and semantic properties of the scene is crucial in autonomous navigation and particularly challenging in the case of Unmanned Aerial Vehicle (UAV). Such information may be obtained by estimating depth and semantic segmentation maps of the surrounding environment, and for practicality, the procedure must be performed as fast as possible. In this thesis, we leverage monocular cameras on aerial robots to predict depth and semantic maps in low-altitude unstructured environments. We propose a joint deep-learning architecture, named Co-SemDepth, that can perform the two tasks accurately and rapidly, and validate its effectiveness on a variety of datasets. The training of neural networks requires an abundance of annotated data, and in the UAV field, the availability of such data is limited due to the specificity of the domain and the burden of the annotation process. Simulation engines allow us to collect annotated data automatically with minimal effort. We introduce a new synthetic dataset in this thesis, TopAir\footnote{Dataset is publicly available:https://huggingface.co/datasets/yaraalaa0/TopAir} that contains images captured with a nadir view in outdoor environments at different altitudes, helping to fill the gap of the scarcity of annotated datasets in the aerial field. While using synthetic data for the training is convenient, it raises issues when shifting to the real domain for testing. We conduct an extensive analytical study to assess the effect of several factors on the synthetic-to-real generalization in depth estimation and semantic segmentation. Co-SemDepth and TaskPrompter models are used for comparison in this study. The results reveal a superior generalization performance for Co-SemDepth in depth estimation and for TaskPrompter in semantic segmentation. Also, our analysis allows us to determine which training datasets lead to a better generalization for depth estimation and semantic segmentation. Using few-shot learning generally improved the generalization outcomes, and a visualization of the 3D semantic maps using the predictions is presented. Moreover, to help attenuate the gap between the synthetic and real domains, image style transfer techniques are explored on aerial images to convert from the synthetic style to the realistic style. Cycle-GAN and Diffusion models are employed. The results reveal that diffusion models are better in the synthetic-to-real style transfer. In the end, we focus on the marine domain and address its challenges. Co-SemDepth is trained on a collected synthetic marine data, called MidSea, and tested on both synthetic and real data. In addition, self-supervised approaches are tried to enhance the results and cope with the limited available annotated data. The results reveal good generalization performance of Co-SemDepth trained from scratch when tested on real data from the SMD dataset, which contains simple marine scenarios, while further enhancement is needed on the MIT Sea Grant dataset, which contains more challenging scenarios.

Real-Time Monocular Scene Analysis for UAV in Outdoor Environments

ABDELMOTTALEB, YARA ALAAELDIN ABDELAZIZ

2026

Abstract

Understanding the geometric and semantic properties of the scene is crucial in autonomous navigation and particularly challenging in the case of Unmanned Aerial Vehicle (UAV). Such information may be obtained by estimating depth and semantic segmentation maps of the surrounding environment, and for practicality, the procedure must be performed as fast as possible. In this thesis, we leverage monocular cameras on aerial robots to predict depth and semantic maps in low-altitude unstructured environments. We propose a joint deep-learning architecture, named Co-SemDepth, that can perform the two tasks accurately and rapidly, and validate its effectiveness on a variety of datasets. The training of neural networks requires an abundance of annotated data, and in the UAV field, the availability of such data is limited due to the specificity of the domain and the burden of the annotation process. Simulation engines allow us to collect annotated data automatically with minimal effort. We introduce a new synthetic dataset in this thesis, TopAir\footnote{Dataset is publicly available:https://huggingface.co/datasets/yaraalaa0/TopAir} that contains images captured with a nadir view in outdoor environments at different altitudes, helping to fill the gap of the scarcity of annotated datasets in the aerial field. While using synthetic data for the training is convenient, it raises issues when shifting to the real domain for testing. We conduct an extensive analytical study to assess the effect of several factors on the synthetic-to-real generalization in depth estimation and semantic segmentation. Co-SemDepth and TaskPrompter models are used for comparison in this study. The results reveal a superior generalization performance for Co-SemDepth in depth estimation and for TaskPrompter in semantic segmentation. Also, our analysis allows us to determine which training datasets lead to a better generalization for depth estimation and semantic segmentation. Using few-shot learning generally improved the generalization outcomes, and a visualization of the 3D semantic maps using the predictions is presented. Moreover, to help attenuate the gap between the synthetic and real domains, image style transfer techniques are explored on aerial images to convert from the synthetic style to the realistic style. Cycle-GAN and Diffusion models are employed. The results reveal that diffusion models are better in the synthetic-to-real style transfer. In the end, we focus on the marine domain and address its challenges. Co-SemDepth is trained on a collected synthetic marine data, called MidSea, and tested on both synthetic and real data. In addition, self-supervised approaches are tried to enhance the results and cope with the limited available annotated data. The results reveal good generalization performance of Co-SemDepth trained from scratch when tested on real data from the SMD dataset, which contains simple marine scenarios, while further enhancement is needed on the MIT Sea Grant dataset, which contains more challenging scenarios.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				100023 - Dipartimento di Informatica, bioingegneria, robotica e ingegneria dei sistemi
			
	Corso di studio
	
				XXXVIII CICLO - ROBOTICS AND INTELLIGENT MACHINES - HOSTILE AND UNSTRUCTURED ENVIRONMENTS
			
	Data di pubblicazione
	
				2-mar-2026
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				ODONE, FRANCESCA
			
	Correlatore, Controrelatore, Co-Supervisor,  Co-Tutor o Coordinatori
	
				SGORBISSA, ANTONIO
			
	Nome Editore
	
				Università degli studi di Genova
			
	Collezione di appartenenza
	
				Università degli Studi di Genova

File in questo prodotto:

File	Dimensione	Formato
phdunige_5750494.pdf accesso aperto Licenza: Tutti i diritti riservati Dimensione 52.44 MB Formato Adobe PDF Visualizza/Apri	52.44 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/359756

Il codice NBN di questa tesi è URN:NBN:IT:UNIGE-359756