The rapid progress of deep generative learning has profoundly transformed multimedia production, opening new possibilities for the automatic synthesis and control of audiovisual content. Among the most promising and challenging directions is the generation of sound that is coherent with video, both in semantics and timing, and that can adapt to the physical and acoustic characteristics of the environment in which it is reproduced. This research area finds its roots in machine learning, acoustics, and creative media technologies, and it is of high interest to both academia and industry. On the academic side, it raises fundamental questions on multimodal representation learning, multimodal alignment, and physical modeling of sound. While on the industrial side, companies operating in cinema, videogames, and extended-reality (XR) production are investing heavily in generative solutions that can assist sound designers, post-production engineers, and interactive content creators. Automating or augmenting sound design through learning-based methods can reduce production time, enhance creative flexibility, and enable fully adaptive audio for immersive experiences. This thesis explores controllable generative audio for realistic simulation in audio- visual and immersive environments, with the goal of learning how to generate sounds that match the visual world semantically, temporally, spatially, and acoustically. The research develops through a coherent sequence of five works, each addressing a spe- cific aspect of this problem. Starting from the synthesis of temporally synchronized Foley effects from silent video, the work evolves toward spatialized sound analysis for virtual environments and concludes with data-driven modeling of physical acoustics through deep learning (DL)-based approximations of wave equations. The methodological backbone of this research is the use of diffusion-based generative models, which have proven highly effective in modeling the temporal and semantic dependencies between modalities. These architectures are extended with interpretable conditioning mechanisms—such as visual onset cues, motion-derived envelopes, and multimodal embeddings—enabling both automation and artistic supervision in audiovisual generation. We begin introducing an onset-synchronized video-to-audio generation model that aligns sound with video events using visual onset detection and diffusion- based synthesis. We then improve temporal and semantic control by separating the when and what of sound generation: we develop a model where a video-driven motion envelope guides the timing, while semantic embeddings define the auditory content. We extend this study by introducing GRAM-aligned multimodal encoders that jointly learn coherent audio, video, and textual representations, enhancing multimodal control and semantic consistency. Our research then moves toward immersive applications, focusing on where sound sources are located in the visual space. We propose a large-scale dataset and benchmark for learning 3D spatial audio and sound source localization from multichannel Ambisonics recordings and visual data, supporting audio generation and analysis in Augmented Reality (AR)/ Virtual Reality (VR) scenarios. Finally, we analyze the ways in which deep learning models can estimate physical acoustics, showing that neural networks can approximate solutions to the Helmholtz equation and emulate how waves propagate across materials and space. This last step grounds generative audio in physically consistent simulation, essential for realism in virtual environments. Together, these works try to unify semantic, temporal, spatial, and physical realism in generative audiovisual learning. Beyond algorithmic advances, the thesis also introduces several high-quality datasets, such as Walking the Maps, L3DAS23 dataset, and HA30K dataset, that address the critical lack of multimodal, well- synchronized data for research in this field. Overall, this thesis focuses on demonstrating how diffusion-based architectures and generative models can serve as controllable and interpretable tools for realistic and creative media synthesis. The proposed frameworks and datasets pave the way for future research and industrial applications, aiming for a new generation of systems capable of producing audiovisual content that is not only perceptually coherent but also acoustically and physically realistic, an essential step for deep learning-based applications in immersive media.
Controllable generative audio for audiovisual immersive environments
GRAMACCIONI, RICCARDO FOSCO
2026
Abstract
The rapid progress of deep generative learning has profoundly transformed multimedia production, opening new possibilities for the automatic synthesis and control of audiovisual content. Among the most promising and challenging directions is the generation of sound that is coherent with video, both in semantics and timing, and that can adapt to the physical and acoustic characteristics of the environment in which it is reproduced. This research area finds its roots in machine learning, acoustics, and creative media technologies, and it is of high interest to both academia and industry. On the academic side, it raises fundamental questions on multimodal representation learning, multimodal alignment, and physical modeling of sound. While on the industrial side, companies operating in cinema, videogames, and extended-reality (XR) production are investing heavily in generative solutions that can assist sound designers, post-production engineers, and interactive content creators. Automating or augmenting sound design through learning-based methods can reduce production time, enhance creative flexibility, and enable fully adaptive audio for immersive experiences. This thesis explores controllable generative audio for realistic simulation in audio- visual and immersive environments, with the goal of learning how to generate sounds that match the visual world semantically, temporally, spatially, and acoustically. The research develops through a coherent sequence of five works, each addressing a spe- cific aspect of this problem. Starting from the synthesis of temporally synchronized Foley effects from silent video, the work evolves toward spatialized sound analysis for virtual environments and concludes with data-driven modeling of physical acoustics through deep learning (DL)-based approximations of wave equations. The methodological backbone of this research is the use of diffusion-based generative models, which have proven highly effective in modeling the temporal and semantic dependencies between modalities. These architectures are extended with interpretable conditioning mechanisms—such as visual onset cues, motion-derived envelopes, and multimodal embeddings—enabling both automation and artistic supervision in audiovisual generation. We begin introducing an onset-synchronized video-to-audio generation model that aligns sound with video events using visual onset detection and diffusion- based synthesis. We then improve temporal and semantic control by separating the when and what of sound generation: we develop a model where a video-driven motion envelope guides the timing, while semantic embeddings define the auditory content. We extend this study by introducing GRAM-aligned multimodal encoders that jointly learn coherent audio, video, and textual representations, enhancing multimodal control and semantic consistency. Our research then moves toward immersive applications, focusing on where sound sources are located in the visual space. We propose a large-scale dataset and benchmark for learning 3D spatial audio and sound source localization from multichannel Ambisonics recordings and visual data, supporting audio generation and analysis in Augmented Reality (AR)/ Virtual Reality (VR) scenarios. Finally, we analyze the ways in which deep learning models can estimate physical acoustics, showing that neural networks can approximate solutions to the Helmholtz equation and emulate how waves propagate across materials and space. This last step grounds generative audio in physically consistent simulation, essential for realism in virtual environments. Together, these works try to unify semantic, temporal, spatial, and physical realism in generative audiovisual learning. Beyond algorithmic advances, the thesis also introduces several high-quality datasets, such as Walking the Maps, L3DAS23 dataset, and HA30K dataset, that address the critical lack of multimodal, well- synchronized data for research in this field. Overall, this thesis focuses on demonstrating how diffusion-based architectures and generative models can serve as controllable and interpretable tools for realistic and creative media synthesis. The proposed frameworks and datasets pave the way for future research and industrial applications, aiming for a new generation of systems capable of producing audiovisual content that is not only perceptually coherent but also acoustically and physically realistic, an essential step for deep learning-based applications in immersive media.| File | Dimensione | Formato | |
|---|---|---|---|
|
Tesi_dottorato_Gramaccioni.pdf
accesso aperto
Licenza:
Creative Commons
Dimensione
28.56 MB
Formato
Adobe PDF
|
28.56 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/357570
URN:NBN:IT:UNIROMA1-357570