Audio-Visual Learning for Scene Understanding

Sanguineti, Valentina

Multimodal deep learning aims at combining the complementary information of different modalities. Among all modalities, audio and video are the predominant ones that humans use to explore the world. In this thesis, we decided to focus our study on audio-visual deep learning to mimic with our networks how humans perceive the world. Our research includes images, audio signals and acoustic images. The latter provide spatial audio information and are obtained from a planar array of microphones combining their raw audios with the beamforming algorithm. They better mimic human auditory systems, which cannot be replicated using just one microphone, not able alone to give spatial sound cues. However, as microphones arrays are not so widespread, we also study how to handle the missing spatialized audio modality at test time. As a solution, we propose to distill acoustic images content to audio features during the training in order to handle their absence at test time. This is done for supervised audio classification using the generalized distillation framework, which we also extend for self-supervised learning. Next, we devise a method for reconstructing acoustic images given a single microphone and an RGB frame. Therefore, in case we just dispose of a standard video, we are able to synthesize spatial audio, which is useful for many audio-visual tasks, including sound localization. Lastly, as another example of restoring one modality from available ones, we inpaint degraded images providing audio features, to reconstruct the missing region not only to be visually plausible but also semantically consistent with the related sound. This includes also cross-modal generation, in the limit case of completely missing or hidden visual modality: our method naturally deals with it, being able to generate images from sound. In summary we show how audio can help visual learning and vice versa, by transferring knowledge between the two modalities at training time, in order to distill, reconstruct, or restore the missing modality at test time.

Audio-Visual Learning for Scene Understanding

SANGUINETI, VALENTINA

2022

Abstract

Multimodal deep learning aims at combining the complementary information of different modalities. Among all modalities, audio and video are the predominant ones that humans use to explore the world. In this thesis, we decided to focus our study on audio-visual deep learning to mimic with our networks how humans perceive the world. Our research includes images, audio signals and acoustic images. The latter provide spatial audio information and are obtained from a planar array of microphones combining their raw audios with the beamforming algorithm. They better mimic human auditory systems, which cannot be replicated using just one microphone, not able alone to give spatial sound cues. However, as microphones arrays are not so widespread, we also study how to handle the missing spatialized audio modality at test time. As a solution, we propose to distill acoustic images content to audio features during the training in order to handle their absence at test time. This is done for supervised audio classification using the generalized distillation framework, which we also extend for self-supervised learning. Next, we devise a method for reconstructing acoustic images given a single microphone and an RGB frame. Therefore, in case we just dispose of a standard video, we are able to synthesize spatial audio, which is useful for many audio-visual tasks, including sound localization. Lastly, as another example of restoring one modality from available ones, we inpaint degraded images providing audio features, to reconstruct the missing region not only to be visually plausible but also semantically consistent with the related sound. This includes also cross-modal generation, in the limit case of completely missing or hidden visual modality: our method naturally deals with it, being able to generate images from sound. In summary we show how audio can help visual learning and vice versa, by transferring knowledge between the two modalities at training time, in order to distill, reconstruct, or restore the missing modality at test time.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				100026 - Dipartimento di Ingegneria navale, elettrica, elettronica e delle telecomunicazioni
			
	Corso di studio
	
				XXXIV CICLO - SCIENZE E TECNOLOGIE PER L'INGEGNERIA ELETTRONICA E DELLE TELECOMUNICAZIONI - Visione computazionale, riconoscimento e apprendimento automatico
			
	Data di pubblicazione
	
				25-feb-2022
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				MURINO, VITTORIO
DEL BUE, ALESSIO
MORERIO, PIETRO
			
	Correlatore, Controrelatore, Co-Supervisor,  Co-Tutor o Coordinatori
	
				VALLE, MAURIZIO
			
	Nome Editore
	
				Università degli studi di Genova
			
	Collezione di appartenenza
	
				Università degli Studi di Genova

File in questo prodotto:

File	Dimensione	Formato
phdunige_3950432_1.pdf accesso aperto Dimensione 1.1 MB Formato Adobe PDF Visualizza/Apri	1.1 MB	Adobe PDF	Visualizza/Apri
phdunige_3950432_2.pdf accesso aperto Dimensione 13.84 MB Formato Adobe PDF Visualizza/Apri	13.84 MB	Adobe PDF	Visualizza/Apri
phdunige_3950432_3.pdf accesso aperto Dimensione 13.38 MB Formato Adobe PDF Visualizza/Apri	13.38 MB	Adobe PDF	Visualizza/Apri
phdunige_3950432_4.pdf accesso aperto Dimensione 16.51 MB Formato Adobe PDF Visualizza/Apri	16.51 MB	Adobe PDF	Visualizza/Apri
phdunige_3950432_5.pdf accesso aperto Dimensione 22.87 MB Formato Adobe PDF Visualizza/Apri	22.87 MB	Adobe PDF	Visualizza/Apri
phdunige_3950432_6.pdf accesso aperto Dimensione 15.66 MB Formato Adobe PDF Visualizza/Apri	15.66 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/68511

Il codice NBN di questa tesi è URN:NBN:IT:UNIGE-68511