An echo of sight: generative models for audio-visual spatial coherence

Marinoni, Christian

While deep generative models have achieved remarkable success in synthesizing high-fidelity images, video, and audio, the next frontier is coherent multimodal synthesis. In the audio-visual domain, current research has largely focused on semantic ("what") and temporal ("when") alignment, while neglecting the equally critical dimension of spatial coherence ("where"). This omission creates a perceptual disconnect that breaks immersion, as the auditory world feels flat and detached from the visual space. This thesis, guided by the principle of "An Echo of Sight", directly addresses this gap by investigating, developing, and advancing generative models that establish robust spatial coherence between audio and visual modalities. The work progresses through four core, interconnected contributions. First, we establish an analytical foundation for spatial audio understanding. Through our work on the L3DAS23 challenge and dataset, we demonstrate that deep learning models can effectively extract and exploit the rich spatial cues embedded in 3D Ambisonics audio for complex analysis tasks, including 3D speech enhancement and sound event localization and detection. Second, we transition from analysis to perceptual synthesis with StereoSync, a novel framework for spatially-aware video-to-audio (V2A) generation. This model is the first to leverage visual spatial cues, such as depth maps and object trajectories, to condition a latent diffusion model, successfully generating stereo audio that spatially pans and aligns with on-screen object dynamics. Third, we address the "off-screen" problem by expanding the generative context to a full spherical environment. We introduce Con360-AV, a framework for joint audio-visual generation conditioned on a complete 360° space. By using panoramic saliency and novel geometric maps, the model generates specific audio-visual viewpoints that are coherently embedded within the larger, surrounding world. Finally, we introduce the HA30K dataset, a large-scale collection of acoustic simulations, and develop a generative surrogate model that learns to approximate the solutions of the Helmholtz equation. This work demonstrates that a generative model can learn the complex physical laws connecting a visual "Sight" (the geometry of a space) to its physical "Echo" (the acoustic pressure field). The proposed frameworks demonstrate significant quantitative improvements in spatial alignment, generative fidelity, and computational efficiency. In particular, StereoSync achieves state-of-the-art spatial tracking, Con360-AV demonstrates robust spatial control in a 360° context, and our physics-based surrogate achieves a nearly 5x speedup over traditional solvers in batch processing. Collectively, this research provides a comprehensive methodology for audio-visual spatial coherence and delivers foundational technologies for the next generation of immersive media, virtual reality, and engineering-focused "Acoustic Digital Twins".

An echo of sight: generative models for audio-visual spatial coherence

MARINONI, CHRISTIAN

2026

Abstract

While deep generative models have achieved remarkable success in synthesizing high-fidelity images, video, and audio, the next frontier is coherent multimodal synthesis. In the audio-visual domain, current research has largely focused on semantic ("what") and temporal ("when") alignment, while neglecting the equally critical dimension of spatial coherence ("where"). This omission creates a perceptual disconnect that breaks immersion, as the auditory world feels flat and detached from the visual space. This thesis, guided by the principle of "An Echo of Sight", directly addresses this gap by investigating, developing, and advancing generative models that establish robust spatial coherence between audio and visual modalities. The work progresses through four core, interconnected contributions. First, we establish an analytical foundation for spatial audio understanding. Through our work on the L3DAS23 challenge and dataset, we demonstrate that deep learning models can effectively extract and exploit the rich spatial cues embedded in 3D Ambisonics audio for complex analysis tasks, including 3D speech enhancement and sound event localization and detection. Second, we transition from analysis to perceptual synthesis with StereoSync, a novel framework for spatially-aware video-to-audio (V2A) generation. This model is the first to leverage visual spatial cues, such as depth maps and object trajectories, to condition a latent diffusion model, successfully generating stereo audio that spatially pans and aligns with on-screen object dynamics. Third, we address the "off-screen" problem by expanding the generative context to a full spherical environment. We introduce Con360-AV, a framework for joint audio-visual generation conditioned on a complete 360° space. By using panoramic saliency and novel geometric maps, the model generates specific audio-visual viewpoints that are coherently embedded within the larger, surrounding world. Finally, we introduce the HA30K dataset, a large-scale collection of acoustic simulations, and develop a generative surrogate model that learns to approximate the solutions of the Helmholtz equation. This work demonstrates that a generative model can learn the complex physical laws connecting a visual "Sight" (the geometry of a space) to its physical "Echo" (the acoustic pressure field). The proposed frameworks demonstrate significant quantitative improvements in spatial alignment, generative fidelity, and computational efficiency. In particular, StereoSync achieves state-of-the-art spatial tracking, Con360-AV demonstrates robust spatial control in a 360° context, and our physics-based surrogate achieves a nearly 5x speedup over traditional solvers in batch processing. Collectively, this research provides a comprehensive methodology for audio-visual spatial coherence and delivers foundational technologies for the next generation of immersive media, virtual reality, and engineering-focused "Acoustic Digital Twins".

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				DIPARTIMENTO DI INGEGNERIA DELL'INFORMAZIONE, ELETTRONICA E TELECOMUNICAZIONI
			
	Corso di studio
	
				Tecnologie dell'informazione e delle comunicazioni (Ict)
			
	Data di pubblicazione
	
				27-gen-2026
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				COMMINIELLO, DANILO
			
	Nome Editore
	
				Università degli Studi di Roma "La Sapienza"
			
	Collezione di appartenenza
	
				Università degli Studi di Roma La Sapienza

File in questo prodotto:

File	Dimensione	Formato
Tesi_dottorato_Marinoni.pdf accesso aperto Licenza: Creative Commons Dimensione 34.97 MB Formato Adobe PDF Visualizza/Apri	34.97 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/357512

Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA1-357512