Learning how to perceive the real world from simulations

Musto, Luigi

Deep learning has brought outstanding results in many tasks that seemed unfeasible, like the ones that require semantic understanding and human knowledge. Convolutional Neural Networks (CNNs), in particular, have been successful in computer vision since they mimic very well the human visual processing. Theoretically they can be trained for any task, depending on how their output layer is designed, but require data to be trained on. The most common setup is supervised learning, where training data is comprised of a set of inputs and corresponding outputs. Still, any neural network has to be deployed in the real world, where the input data is slightly different from training. Modern deep models require large and heterogeneous datasets to reach their full potential, but gathering such data is expensive. Simulators provide an attractive solution to the labeling problem, since they are able to automatically generate data and ground-truth. However, despite the photo-realism of modern computer graphics, they still lack many characteristics of real-world images. Therefore, CNNs trained on synthetic images perform poorly when tested on real ones. When circumstances or particular conditions change the distribution of content, we identify them as domains, e.g. sunny and rainy, summer and winter. The change between the distributions of two domains, and, consequently, between the features extracted, is called domain shift. To cope with the issues that domain shift causes for neural networks, one has to solve a problem of domain adaptation. In our work, we consider the particular case of Unsupervised Domain Adaptation (UDA), where no label is available for the domain used during testing. This work aims at developing perception methods that are able to work in real environments without needing any labeled real image. First we present a full system for UDA of Semantic Segmentation. Our system is a combination of pixel-level and feature-level alignment, which cooperate through bidirectional learning to improve each other. In particular we develop a Semantically Adaptive Image-to-image Translation method, which exploits the task network to adaptively align each class among the two input domains. This is used to translate synthetic images to the real domain and the result is then used to train the segmentation model. When training the segmentation model, we also perform feature-level alignment through adversarial learning and self-supervision. We extensively test our system on the common datasets and architectures and show how our results surpass the state-of-the-art of UDA for Semantic Segmentation. Then we extend our work to a more complex task, Panoptic Segmentation. We develop a method to separately adapt each subtask by training each branch in an adversarial fashion. In particular we align the features extracted by each branch of the task network through a different discriminator, making us able to also align the final panoptic output. We experiment with this method on a synthetic-to-real setting of our choice and perform an ablation study to investigate the best solution for Panoptic Domain Adaptation.

Learning how to perceive the real world from simulations

Musto, Luigi

2021

Abstract

Deep learning has brought outstanding results in many tasks that seemed unfeasible, like the ones that require semantic understanding and human knowledge. Convolutional Neural Networks (CNNs), in particular, have been successful in computer vision since they mimic very well the human visual processing. Theoretically they can be trained for any task, depending on how their output layer is designed, but require data to be trained on. The most common setup is supervised learning, where training data is comprised of a set of inputs and corresponding outputs. Still, any neural network has to be deployed in the real world, where the input data is slightly different from training. Modern deep models require large and heterogeneous datasets to reach their full potential, but gathering such data is expensive. Simulators provide an attractive solution to the labeling problem, since they are able to automatically generate data and ground-truth. However, despite the photo-realism of modern computer graphics, they still lack many characteristics of real-world images. Therefore, CNNs trained on synthetic images perform poorly when tested on real ones. When circumstances or particular conditions change the distribution of content, we identify them as domains, e.g. sunny and rainy, summer and winter. The change between the distributions of two domains, and, consequently, between the features extracted, is called domain shift. To cope with the issues that domain shift causes for neural networks, one has to solve a problem of domain adaptation. In our work, we consider the particular case of Unsupervised Domain Adaptation (UDA), where no label is available for the domain used during testing. This work aims at developing perception methods that are able to work in real environments without needing any labeled real image. First we present a full system for UDA of Semantic Segmentation. Our system is a combination of pixel-level and feature-level alignment, which cooperate through bidirectional learning to improve each other. In particular we develop a Semantically Adaptive Image-to-image Translation method, which exploits the task network to adaptively align each class among the two input domains. This is used to translate synthetic images to the real domain and the result is then used to train the segmentation model. When training the segmentation model, we also perform feature-level alignment through adversarial learning and self-supervision. We extensively test our system on the common datasets and architectures and show how our results surpass the state-of-the-art of UDA for Semantic Segmentation. Then we extend our work to a more complex task, Panoptic Segmentation. We develop a method to separately adapt each subtask by training each branch in an adversarial fashion. In particular we align the features extracted by each branch of the task network through a different discriminator, making us able to also align the final panoptic output. We experiment with this method on a synthetic-to-real setting of our choice and perform an ablation study to investigate the best solution for Panoptic Domain Adaptation.

Scheda breve

Scheda completa

Scheda completa (DC)

	Corso di studio
	
				Dottorato di ricerca in Tecnologie dell'informazione
			
	Data di pubblicazione
	
				2021
			
	Lingua
	
				Inglese
			
	Parola chiave
	
				computer vision
deep learning
domain adaptation
semantic segmentation
panoptic segmentation
			
	Relatore, Supervisor, Advisor o Tutor
	
				Bertozzi, Massimo
			
	Nome Editore
	
				Università degli Studi di Parma
			
	Collezione di appartenenza
	
				Università degli Studi di Parma

File in questo prodotto:

File	Dimensione	Formato
final_report.pdf accesso solo da BNCF e BNCR Tipologia: Altro materiale allegato Dimensione 5.5 kB Formato Adobe PDF	5.5 kB	Adobe PDF
PhDThesis.pdf accesso solo da BNCF e BNCR Tipologia: Altro materiale allegato Dimensione 95.36 MB Formato Adobe PDF	95.36 MB	Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/149785

Il codice NBN di questa tesi è URN:NBN:IT:UNIPR-149785