Recent advancements in developing neural architectures has led to the curation of complex deep neural networks which exhibit promising performance on downstream speech applications, like automatic speech recognition. The superior performance of these models is linked to the immense training data and time, along with significant computational resource requirements. Effectively leveraging the capabilities of such models on varying and resource-constrained edge devices pose considerable challenges. Therefore, it is desirable to optimize these models with minimal performance degradation to render them operational on resource constrained devices. In addition, since the on-device resources are mutually employed for device working, the model should have architectural adaptation capability for seamless operation with respect to the on-device resources. This thesis proposes novel dynamic compression methods to address these limitation by allowing a model to adapt its architecture on-the-fly. The first contribution is the introduction of a semi-dynamic training framework where a lightweight conformer learns the encoder level representation from a large conformer model leveraging knowledge distillation. As a second step, the decoder is appended on top of the distilled model and finetuning is performed to achieve lightweight models with 3x speed-up as compared to existing approaches. The second contribution is the thorough exploration of various inference-time dropping methods and to quantify performance of conformer model when it is trained with random layer dropping with different sparsity levels. This work laid the foundation to achieve the optimum performance-computation trade-off when the encoder layers are dropped at random, and also served as baseline for future works. The third contribution is the introduction of a novel input conditioned layer dropping approach which was applied on speech foundation models, like wavlm and audio spectrogram transformer, to instill the architectural adaptability in them. Additionally, it allows the deployment of speech foundation models on resource-constrained edge devices by reducing their computational load with minimal performance degradation. Lastly, this thesis employs knowledge distillation to further improve the performance-computation trade-off by minimizing the distance between embeddings. The proposed dynamic compression methods achieve comparable performance with 50% reduced computational complexity of what offered by full model. In addition, the compression limit is extendable to 70% with acceptable performance, enabling the model to operate in varying and constrained resource environments.

Towards Self-configuring Efficient Neural Architectures for Automatic Speech Recognition

Hannan, Abdul
2026

Abstract

Recent advancements in developing neural architectures has led to the curation of complex deep neural networks which exhibit promising performance on downstream speech applications, like automatic speech recognition. The superior performance of these models is linked to the immense training data and time, along with significant computational resource requirements. Effectively leveraging the capabilities of such models on varying and resource-constrained edge devices pose considerable challenges. Therefore, it is desirable to optimize these models with minimal performance degradation to render them operational on resource constrained devices. In addition, since the on-device resources are mutually employed for device working, the model should have architectural adaptation capability for seamless operation with respect to the on-device resources. This thesis proposes novel dynamic compression methods to address these limitation by allowing a model to adapt its architecture on-the-fly. The first contribution is the introduction of a semi-dynamic training framework where a lightweight conformer learns the encoder level representation from a large conformer model leveraging knowledge distillation. As a second step, the decoder is appended on top of the distilled model and finetuning is performed to achieve lightweight models with 3x speed-up as compared to existing approaches. The second contribution is the thorough exploration of various inference-time dropping methods and to quantify performance of conformer model when it is trained with random layer dropping with different sparsity levels. This work laid the foundation to achieve the optimum performance-computation trade-off when the encoder layers are dropped at random, and also served as baseline for future works. The third contribution is the introduction of a novel input conditioned layer dropping approach which was applied on speech foundation models, like wavlm and audio spectrogram transformer, to instill the architectural adaptability in them. Additionally, it allows the deployment of speech foundation models on resource-constrained edge devices by reducing their computational load with minimal performance degradation. Lastly, this thesis employs knowledge distillation to further improve the performance-computation trade-off by minimizing the distance between embeddings. The proposed dynamic compression methods achieve comparable performance with 50% reduced computational complexity of what offered by full model. In addition, the compression limit is extendable to 70% with acceptable performance, enabling the model to operate in varying and constrained resource environments.
25-feb-2026
Inglese
Brutti, Alessio
Università degli studi di Trento
TRENTO
66
File in questo prodotto:
File Dimensione Formato  
Phd Thesis - Abdul Hannan.pdf

accesso aperto

Licenza: Tutti i diritti riservati
Dimensione 4.53 MB
Formato Adobe PDF
4.53 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/359613
Il codice NBN di questa tesi è URN:NBN:IT:UNITN-359613