Efficient Knowledge Transfer and Adaptation for Speech and Beyond

Cappellazzo, Umberto

This thesis advances the field of efficient knowledge transfer and adaptation in the realm of speech processing. It is structured to address the limitations of transfer learning in dynamically evolving audio and speech processing contexts, particularly through novel approaches for class-incremental learning, parameter-efficient adaptation, and multimodal modeling. First, we provide a comprehensive framework for class-incremental spoken language understanding, allowing models to incrementally learn new intents and entities while retaining previously acquired knowledge. Using knowledge distillation and rehearsal-based strategies, we enhance robustness against catastrophic forgetting, a key limitation in continual learning. We also introduce a unique approach to class-incremental audio classification, utilizing mutual information optimization. Second, given the prohibitive computational costs of traditional model adaptation (i.e., full fine-tuning), this thesis introduces a comprehensive framework for parameter-efficient fine-tuning of audio and speech foundation models. Furthermore, we propose new adapter designs to achieve effective transfer learning with minimal computational overhead. Finally, extending beyond unimodal settings, we propose Llama-AVSR, a new multimodal large language model with strong audio-visual speech recognition (AVSR) abilities. Llama-AVSR leverages pre-trained audio and video encoders along with a large language model to achieve state-of-the-art accuracy on the major AVSR benchmark. Notably, this model adapts pre-trained language models to the multimodal AVSR domain while keeping the core model parameters frozen, ensuring computational efficiency. Overall, this thesis provides a comprehensive framework and empirical findings that advance the application of continual learning, efficient fine-tuning, and multimodal models for speech processing tasks. These contributions pave the way for adaptive, resource-efficient speech understanding systems.

Efficient Knowledge Transfer and Adaptation for Speech and Beyond

Cappellazzo, Umberto

2025

Abstract

This thesis advances the field of efficient knowledge transfer and adaptation in the realm of speech processing. It is structured to address the limitations of transfer learning in dynamically evolving audio and speech processing contexts, particularly through novel approaches for class-incremental learning, parameter-efficient adaptation, and multimodal modeling. First, we provide a comprehensive framework for class-incremental spoken language understanding, allowing models to incrementally learn new intents and entities while retaining previously acquired knowledge. Using knowledge distillation and rehearsal-based strategies, we enhance robustness against catastrophic forgetting, a key limitation in continual learning. We also introduce a unique approach to class-incremental audio classification, utilizing mutual information optimization. Second, given the prohibitive computational costs of traditional model adaptation (i.e., full fine-tuning), this thesis introduces a comprehensive framework for parameter-efficient fine-tuning of audio and speech foundation models. Furthermore, we propose new adapter designs to achieve effective transfer learning with minimal computational overhead. Finally, extending beyond unimodal settings, we propose Llama-AVSR, a new multimodal large language model with strong audio-visual speech recognition (AVSR) abilities. Llama-AVSR leverages pre-trained audio and video encoders along with a large language model to achieve state-of-the-art accuracy on the major AVSR benchmark. Notably, this model adapts pre-trained language models to the multimodal AVSR domain while keeping the core model parameters frozen, ensuring computational efficiency. Overall, this thesis provides a comprehensive framework and empirical findings that advance the application of continual learning, efficient fine-tuning, and multimodal models for speech processing tasks. These contributions pave the way for adaptive, resource-efficient speech understanding systems.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				Ingegneria e scienza dell'Informaz (29/10/12-)
			
	Corso di studio
	
				Mathematics
			
	Data di pubblicazione
	
				15-gen-2025
			
	Lingua
	
				Inglese
			
	Nome Editore
	
				Università degli studi di Trento
			
	Città Editore
	
				TRENTO
			
	Numero di pagine
	
				148
			
	Collezione di appartenenza
	
				Università degli Studi di Trento

File in questo prodotto:

File	Dimensione	Formato
Doctoral_Thesis_Umberto_Cappellazzo.pdf accesso aperto Dimensione 5.8 MB Formato Adobe PDF Visualizza/Apri	5.8 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/188808

Il codice NBN di questa tesi è URN:NBN:IT:UNITN-188808