While deep generative models have achieved remarkable success in synthesizing high-fidelity images, video, and audio, the next frontier is coherent multimodal synthesis. In the audio-visual domain, current research has largely focused on semantic ("what") and temporal ("when") alignment, while neglecting the equally critical dimension of spatial coherence ("where"). This omission creates a perceptual disconnect that breaks immersion, as the auditory world feels flat and detached from the visual space. This thesis, guided by the principle of "An Echo of Sight", directly addresses this gap by investigating, developing, and advancing generative models that establish robust spatial coherence between audio and visual modalities. The work progresses through four core, interconnected contributions. First, we establish an analytical foundation for spatial audio understanding. Through our work on the L3DAS23 challenge and dataset, we demonstrate that deep learning models can effectively extract and exploit the rich spatial cues embedded in 3D Ambisonics audio for complex analysis tasks, including 3D speech enhancement and sound event localization and detection. Second, we transition from analysis to perceptual synthesis with StereoSync, a novel framework for spatially-aware video-to-audio (V2A) generation. This model is the first to leverage visual spatial cues, such as depth maps and object trajectories, to condition a latent diffusion model, successfully generating stereo audio that spatially pans and aligns with on-screen object dynamics. Third, we address the "off-screen" problem by expanding the generative context to a full spherical environment. We introduce Con360-AV, a framework for joint audio-visual generation conditioned on a complete 360° space. By using panoramic saliency and novel geometric maps, the model generates specific audio-visual viewpoints that are coherently embedded within the larger, surrounding world. Finally, we introduce the HA30K dataset, a large-scale collection of acoustic simulations, and develop a generative surrogate model that learns to approximate the solutions of the Helmholtz equation. This work demonstrates that a generative model can learn the complex physical laws connecting a visual "Sight" (the geometry of a space) to its physical "Echo" (the acoustic pressure field). The proposed frameworks demonstrate significant quantitative improvements in spatial alignment, generative fidelity, and computational efficiency. In particular, StereoSync achieves state-of-the-art spatial tracking, Con360-AV demonstrates robust spatial control in a 360° context, and our physics-based surrogate achieves a nearly 5x speedup over traditional solvers in batch processing. Collectively, this research provides a comprehensive methodology for audio-visual spatial coherence and delivers foundational technologies for the next generation of immersive media, virtual reality, and engineering-focused "Acoustic Digital Twins".

An echo of sight: generative models for audio-visual spatial coherence

MARINONI, CHRISTIAN
2026

Abstract

While deep generative models have achieved remarkable success in synthesizing high-fidelity images, video, and audio, the next frontier is coherent multimodal synthesis. In the audio-visual domain, current research has largely focused on semantic ("what") and temporal ("when") alignment, while neglecting the equally critical dimension of spatial coherence ("where"). This omission creates a perceptual disconnect that breaks immersion, as the auditory world feels flat and detached from the visual space. This thesis, guided by the principle of "An Echo of Sight", directly addresses this gap by investigating, developing, and advancing generative models that establish robust spatial coherence between audio and visual modalities. The work progresses through four core, interconnected contributions. First, we establish an analytical foundation for spatial audio understanding. Through our work on the L3DAS23 challenge and dataset, we demonstrate that deep learning models can effectively extract and exploit the rich spatial cues embedded in 3D Ambisonics audio for complex analysis tasks, including 3D speech enhancement and sound event localization and detection. Second, we transition from analysis to perceptual synthesis with StereoSync, a novel framework for spatially-aware video-to-audio (V2A) generation. This model is the first to leverage visual spatial cues, such as depth maps and object trajectories, to condition a latent diffusion model, successfully generating stereo audio that spatially pans and aligns with on-screen object dynamics. Third, we address the "off-screen" problem by expanding the generative context to a full spherical environment. We introduce Con360-AV, a framework for joint audio-visual generation conditioned on a complete 360° space. By using panoramic saliency and novel geometric maps, the model generates specific audio-visual viewpoints that are coherently embedded within the larger, surrounding world. Finally, we introduce the HA30K dataset, a large-scale collection of acoustic simulations, and develop a generative surrogate model that learns to approximate the solutions of the Helmholtz equation. This work demonstrates that a generative model can learn the complex physical laws connecting a visual "Sight" (the geometry of a space) to its physical "Echo" (the acoustic pressure field). The proposed frameworks demonstrate significant quantitative improvements in spatial alignment, generative fidelity, and computational efficiency. In particular, StereoSync achieves state-of-the-art spatial tracking, Con360-AV demonstrates robust spatial control in a 360° context, and our physics-based surrogate achieves a nearly 5x speedup over traditional solvers in batch processing. Collectively, this research provides a comprehensive methodology for audio-visual spatial coherence and delivers foundational technologies for the next generation of immersive media, virtual reality, and engineering-focused "Acoustic Digital Twins".
27-gen-2026
Inglese
COMMINIELLO, DANILO
Università degli Studi di Roma "La Sapienza"
File in questo prodotto:
File Dimensione Formato  
Tesi_dottorato_Marinoni.pdf

accesso aperto

Licenza: Creative Commons
Dimensione 34.97 MB
Formato Adobe PDF
34.97 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/357512
Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA1-357512