From source separation to compositional music generation

Postolache, Emilian

This thesis proposes a journey into sound processing through deep learning, particularly generative models, exploring the compositional structure of sound, which is layered in different sources that compose the final auditory experience. The first part of the text focuses on the problem of separating the sources from mixtures, initially using a deterministic separator trained via adversarial losses in a permutation invariant manner and then exploring the setting of Bayesian inference through the use of latent autoregressive models. In the second half of the thesis, we focus on the continuous musical setting (as opposed to symbolic), where the sources that compose the sound are interdependent. By modeling this interdependence probabilistically, we develop diffusion models that allow for the compositional processing of the different stems present in tracks, thus not only separating them but generating them in a conditioned manner (accompaniments). Subsequently, we generalize these models to text conditioned diffusion models without requiring supervised data. We conclude the thesis by discussing possible developments in the compositional generation of audio.

From source separation to compositional music generation

POSTOLACHE, EMILIAN

2024

Abstract

This thesis proposes a journey into sound processing through deep learning, particularly generative models, exploring the compositional structure of sound, which is layered in different sources that compose the final auditory experience. The first part of the text focuses on the problem of separating the sources from mixtures, initially using a deterministic separator trained via adversarial losses in a permutation invariant manner and then exploring the setting of Bayesian inference through the use of latent autoregressive models. In the second half of the thesis, we focus on the continuous musical setting (as opposed to symbolic), where the sources that compose the sound are interdependent. By modeling this interdependence probabilistically, we develop diffusion models that allow for the compositional processing of the different stems present in tracks, thus not only separating them but generating them in a conditioned manner (accompaniments). Subsequently, we generalize these models to text conditioned diffusion models without requiring supervised data. We conclude the thesis by discussing possible developments in the compositional generation of audio.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				DIPARTIMENTO DI INFORMATICA
			
	Corso di studio
	
				Informatica
			
	Data di pubblicazione
	
				29-mag-2024
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				RODOLA', EMANUELE
			
	Nome Editore
	
				Università degli Studi di Roma "La Sapienza"
			
	Numero di pagine
	
				100
			
	Collezione di appartenenza
	
				Università degli Studi di Roma La Sapienza

File in questo prodotto:

File	Dimensione	Formato
Tesi_dottorato_Postolache.pdf accesso aperto Licenza: Tutti i diritti riservati Dimensione 4.77 MB Formato Adobe PDF Visualizza/Apri	4.77 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/182943

Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA1-182943