Multimodal deep learning aims at combining the complementary information of different modalities. Among all modalities, audio and video are the predominant ones that humans use to explore the world. In this thesis, we decided to focus our study on audio-visual deep learning to mimic with our networks how humans perceive the world. Our research includes images, audio signals and acoustic images. The latter provide spatial audio information and are obtained from a planar array of microphones combining their raw audios with the beamforming algorithm. They better mimic human auditory systems, which cannot be replicated using just one microphone, not able alone to give spatial sound cues. However, as microphones arrays are not so widespread, we also study how to handle the missing spatialized audio modality at test time. As a solution, we propose to distill acoustic images content to audio features during the training in order to handle their absence at test time. This is done for supervised audio classification using the generalized distillation framework, which we also extend for self-supervised learning. Next, we devise a method for reconstructing acoustic images given a single microphone and an RGB frame. Therefore, in case we just dispose of a standard video, we are able to synthesize spatial audio, which is useful for many audio-visual tasks, including sound localization. Lastly, as another example of restoring one modality from available ones, we inpaint degraded images providing audio features, to reconstruct the missing region not only to be visually plausible but also semantically consistent with the related sound. This includes also cross-modal generation, in the limit case of completely missing or hidden visual modality: our method naturally deals with it, being able to generate images from sound. In summary we show how audio can help visual learning and vice versa, by transferring knowledge between the two modalities at training time, in order to distill, reconstruct, or restore the missing modality at test time.

Audio-Visual Learning for Scene Understanding

SANGUINETI, VALENTINA
2022

Abstract

Multimodal deep learning aims at combining the complementary information of different modalities. Among all modalities, audio and video are the predominant ones that humans use to explore the world. In this thesis, we decided to focus our study on audio-visual deep learning to mimic with our networks how humans perceive the world. Our research includes images, audio signals and acoustic images. The latter provide spatial audio information and are obtained from a planar array of microphones combining their raw audios with the beamforming algorithm. They better mimic human auditory systems, which cannot be replicated using just one microphone, not able alone to give spatial sound cues. However, as microphones arrays are not so widespread, we also study how to handle the missing spatialized audio modality at test time. As a solution, we propose to distill acoustic images content to audio features during the training in order to handle their absence at test time. This is done for supervised audio classification using the generalized distillation framework, which we also extend for self-supervised learning. Next, we devise a method for reconstructing acoustic images given a single microphone and an RGB frame. Therefore, in case we just dispose of a standard video, we are able to synthesize spatial audio, which is useful for many audio-visual tasks, including sound localization. Lastly, as another example of restoring one modality from available ones, we inpaint degraded images providing audio features, to reconstruct the missing region not only to be visually plausible but also semantically consistent with the related sound. This includes also cross-modal generation, in the limit case of completely missing or hidden visual modality: our method naturally deals with it, being able to generate images from sound. In summary we show how audio can help visual learning and vice versa, by transferring knowledge between the two modalities at training time, in order to distill, reconstruct, or restore the missing modality at test time.
25-feb-2022
Inglese
MURINO, VITTORIO
DEL BUE, ALESSIO
MORERIO, PIETRO
VALLE, MAURIZIO
Università degli studi di Genova
File in questo prodotto:
File Dimensione Formato  
phdunige_3950432_1.pdf

accesso aperto

Dimensione 1.1 MB
Formato Adobe PDF
1.1 MB Adobe PDF Visualizza/Apri
phdunige_3950432_2.pdf

accesso aperto

Dimensione 13.84 MB
Formato Adobe PDF
13.84 MB Adobe PDF Visualizza/Apri
phdunige_3950432_3.pdf

accesso aperto

Dimensione 13.38 MB
Formato Adobe PDF
13.38 MB Adobe PDF Visualizza/Apri
phdunige_3950432_4.pdf

accesso aperto

Dimensione 16.51 MB
Formato Adobe PDF
16.51 MB Adobe PDF Visualizza/Apri
phdunige_3950432_5.pdf

accesso aperto

Dimensione 22.87 MB
Formato Adobe PDF
22.87 MB Adobe PDF Visualizza/Apri
phdunige_3950432_6.pdf

accesso aperto

Dimensione 15.66 MB
Formato Adobe PDF
15.66 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/125741
Il codice NBN di questa tesi è URN:NBN:IT:UNIGE-125741