Towards Self-configuring Efficient Neural Architectures for Automatic Speech Recognition

Hannan, Abdul

Recent advancements in developing neural architectures has led to the curation of complex deep neural networks which exhibit promising performance on downstream speech applications, like automatic speech recognition. The superior performance of these models is linked to the immense training data and time, along with significant computational resource requirements. Effectively leveraging the capabilities of such models on varying and resource-constrained edge devices pose considerable challenges. Therefore, it is desirable to optimize these models with minimal performance degradation to render them operational on resource constrained devices. In addition, since the on-device resources are mutually employed for device working, the model should have architectural adaptation capability for seamless operation with respect to the on-device resources. This thesis proposes novel dynamic compression methods to address these limitation by allowing a model to adapt its architecture on-the-fly. The first contribution is the introduction of a semi-dynamic training framework where a lightweight conformer learns the encoder level representation from a large conformer model leveraging knowledge distillation. As a second step, the decoder is appended on top of the distilled model and finetuning is performed to achieve lightweight models with 3x speed-up as compared to existing approaches. The second contribution is the thorough exploration of various inference-time dropping methods and to quantify performance of conformer model when it is trained with random layer dropping with different sparsity levels. This work laid the foundation to achieve the optimum performance-computation trade-off when the encoder layers are dropped at random, and also served as baseline for future works. The third contribution is the introduction of a novel input conditioned layer dropping approach which was applied on speech foundation models, like wavlm and audio spectrogram transformer, to instill the architectural adaptability in them. Additionally, it allows the deployment of speech foundation models on resource-constrained edge devices by reducing their computational load with minimal performance degradation. Lastly, this thesis employs knowledge distillation to further improve the performance-computation trade-off by minimizing the distance between embeddings. The proposed dynamic compression methods achieve comparable performance with 50% reduced computational complexity of what offered by full model. In addition, the compression limit is extendable to 70% with acceptable performance, enabling the model to operate in varying and constrained resource environments.

Towards Self-configuring Efficient Neural Architectures for Automatic Speech Recognition

Hannan, Abdul

2026

Abstract

Recent advancements in developing neural architectures has led to the curation of complex deep neural networks which exhibit promising performance on downstream speech applications, like automatic speech recognition. The superior performance of these models is linked to the immense training data and time, along with significant computational resource requirements. Effectively leveraging the capabilities of such models on varying and resource-constrained edge devices pose considerable challenges. Therefore, it is desirable to optimize these models with minimal performance degradation to render them operational on resource constrained devices. In addition, since the on-device resources are mutually employed for device working, the model should have architectural adaptation capability for seamless operation with respect to the on-device resources. This thesis proposes novel dynamic compression methods to address these limitation by allowing a model to adapt its architecture on-the-fly. The first contribution is the introduction of a semi-dynamic training framework where a lightweight conformer learns the encoder level representation from a large conformer model leveraging knowledge distillation. As a second step, the decoder is appended on top of the distilled model and finetuning is performed to achieve lightweight models with 3x speed-up as compared to existing approaches. The second contribution is the thorough exploration of various inference-time dropping methods and to quantify performance of conformer model when it is trained with random layer dropping with different sparsity levels. This work laid the foundation to achieve the optimum performance-computation trade-off when the encoder layers are dropped at random, and also served as baseline for future works. The third contribution is the introduction of a novel input conditioned layer dropping approach which was applied on speech foundation models, like wavlm and audio spectrogram transformer, to instill the architectural adaptability in them. Additionally, it allows the deployment of speech foundation models on resource-constrained edge devices by reducing their computational load with minimal performance degradation. Lastly, this thesis employs knowledge distillation to further improve the performance-computation trade-off by minimizing the distance between embeddings. The proposed dynamic compression methods achieve comparable performance with 50% reduced computational complexity of what offered by full model. In addition, the compression limit is extendable to 70% with acceptable performance, enabling the model to operate in varying and constrained resource environments.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				Università degli Studi di Trento
			
	Corso di studio
	
				Information and Communication Technology
			
	Data di pubblicazione
	
				25-feb-2026
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				Brutti, Alessio
			
	Nome Editore
	
				Università degli studi di Trento
			
	Città Editore
	
				TRENTO
			
	Numero di pagine
	
				66
			
	Collezione di appartenenza
	
				Università degli Studi di Trento

File in questo prodotto:

File	Dimensione	Formato
Phd Thesis - Abdul Hannan.pdf accesso aperto Licenza: Tutti i diritti riservati Dimensione 4.53 MB Formato Adobe PDF Visualizza/Apri	4.53 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/359613

Il codice NBN di questa tesi è URN:NBN:IT:UNITN-359613