This thesis advances the field of efficient knowledge transfer and adaptation in the realm of speech processing. It is structured to address the limitations of transfer learning in dynamically evolving audio and speech processing contexts, particularly through novel approaches for class-incremental learning, parameter-efficient adaptation, and multimodal modeling. First, we provide a comprehensive framework for class-incremental spoken language understanding, allowing models to incrementally learn new intents and entities while retaining previously acquired knowledge. Using knowledge distillation and rehearsal-based strategies, we enhance robustness against catastrophic forgetting, a key limitation in continual learning. We also introduce a unique approach to class-incremental audio classification, utilizing mutual information optimization. Second, given the prohibitive computational costs of traditional model adaptation (i.e., full fine-tuning), this thesis introduces a comprehensive framework for parameter-efficient fine-tuning of audio and speech foundation models. Furthermore, we propose new adapter designs to achieve effective transfer learning with minimal computational overhead. Finally, extending beyond unimodal settings, we propose Llama-AVSR, a new multimodal large language model with strong audio-visual speech recognition (AVSR) abilities. Llama-AVSR leverages pre-trained audio and video encoders along with a large language model to achieve state-of-the-art accuracy on the major AVSR benchmark. Notably, this model adapts pre-trained language models to the multimodal AVSR domain while keeping the core model parameters frozen, ensuring computational efficiency. Overall, this thesis provides a comprehensive framework and empirical findings that advance the application of continual learning, efficient fine-tuning, and multimodal models for speech processing tasks. These contributions pave the way for adaptive, resource-efficient speech understanding systems.

Efficient Knowledge Transfer and Adaptation for Speech and Beyond

Cappellazzo, Umberto
2025

Abstract

This thesis advances the field of efficient knowledge transfer and adaptation in the realm of speech processing. It is structured to address the limitations of transfer learning in dynamically evolving audio and speech processing contexts, particularly through novel approaches for class-incremental learning, parameter-efficient adaptation, and multimodal modeling. First, we provide a comprehensive framework for class-incremental spoken language understanding, allowing models to incrementally learn new intents and entities while retaining previously acquired knowledge. Using knowledge distillation and rehearsal-based strategies, we enhance robustness against catastrophic forgetting, a key limitation in continual learning. We also introduce a unique approach to class-incremental audio classification, utilizing mutual information optimization. Second, given the prohibitive computational costs of traditional model adaptation (i.e., full fine-tuning), this thesis introduces a comprehensive framework for parameter-efficient fine-tuning of audio and speech foundation models. Furthermore, we propose new adapter designs to achieve effective transfer learning with minimal computational overhead. Finally, extending beyond unimodal settings, we propose Llama-AVSR, a new multimodal large language model with strong audio-visual speech recognition (AVSR) abilities. Llama-AVSR leverages pre-trained audio and video encoders along with a large language model to achieve state-of-the-art accuracy on the major AVSR benchmark. Notably, this model adapts pre-trained language models to the multimodal AVSR domain while keeping the core model parameters frozen, ensuring computational efficiency. Overall, this thesis provides a comprehensive framework and empirical findings that advance the application of continual learning, efficient fine-tuning, and multimodal models for speech processing tasks. These contributions pave the way for adaptive, resource-efficient speech understanding systems.
15-gen-2025
Inglese
Università degli studi di Trento
TRENTO
148
File in questo prodotto:
File Dimensione Formato  
Doctoral_Thesis_Umberto_Cappellazzo.pdf

accesso aperto

Dimensione 5.8 MB
Formato Adobe PDF
5.8 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/188808
Il codice NBN di questa tesi è URN:NBN:IT:UNITN-188808