Recent advancements in developing neural architectures has led to the curation of complex deep neural networks which exhibit promising performance on downstream speech applications, like automatic speech recognition. The superior performance of these models is linked to the immense training data and time, along with significant computational resource requirements. Effectively leveraging the capabilities of such models on varying and resource-constrained edge devices pose considerable challenges. Therefore, it is desirable to optimize these models with minimal performance degradation to render them operational on resource constrained devices. In addition, since the on-device resources are mutually employed for device working, the model should have architectural adaptation capability for seamless operation with respect to the on-device resources. This thesis proposes novel dynamic compression methods to address these limitation by allowing a model to adapt its architecture on-the-fly. The first contribution is the introduction of a semi-dynamic training framework where a lightweight conformer learns the encoder level representation from a large conformer model leveraging knowledge distillation. As a second step, the decoder is appended on top of the distilled model and finetuning is performed to achieve lightweight models with 3x speed-up as compared to existing approaches. The second contribution is the thorough exploration of various inference-time dropping methods and to quantify performance of conformer model when it is trained with random layer dropping with different sparsity levels. This work laid the foundation to achieve the optimum performance-computation trade-off when the encoder layers are dropped at random, and also served as baseline for future works. The third contribution is the introduction of a novel input conditioned layer dropping approach which was applied on speech foundation models, like wavlm and audio spectrogram transformer, to instill the architectural adaptability in them. Additionally, it allows the deployment of speech foundation models on resource-constrained edge devices by reducing their computational load with minimal performance degradation. Lastly, this thesis employs knowledge distillation to further improve the performance-computation trade-off by minimizing the distance between embeddings. The proposed dynamic compression methods achieve comparable performance with 50% reduced computational complexity of what offered by full model. In addition, the compression limit is extendable to 70% with acceptable performance, enabling the model to operate in varying and constrained resource environments.
Towards Self-configuring Efficient Neural Architectures for Automatic Speech Recognition
Hannan, Abdul
2026
Abstract
Recent advancements in developing neural architectures has led to the curation of complex deep neural networks which exhibit promising performance on downstream speech applications, like automatic speech recognition. The superior performance of these models is linked to the immense training data and time, along with significant computational resource requirements. Effectively leveraging the capabilities of such models on varying and resource-constrained edge devices pose considerable challenges. Therefore, it is desirable to optimize these models with minimal performance degradation to render them operational on resource constrained devices. In addition, since the on-device resources are mutually employed for device working, the model should have architectural adaptation capability for seamless operation with respect to the on-device resources. This thesis proposes novel dynamic compression methods to address these limitation by allowing a model to adapt its architecture on-the-fly. The first contribution is the introduction of a semi-dynamic training framework where a lightweight conformer learns the encoder level representation from a large conformer model leveraging knowledge distillation. As a second step, the decoder is appended on top of the distilled model and finetuning is performed to achieve lightweight models with 3x speed-up as compared to existing approaches. The second contribution is the thorough exploration of various inference-time dropping methods and to quantify performance of conformer model when it is trained with random layer dropping with different sparsity levels. This work laid the foundation to achieve the optimum performance-computation trade-off when the encoder layers are dropped at random, and also served as baseline for future works. The third contribution is the introduction of a novel input conditioned layer dropping approach which was applied on speech foundation models, like wavlm and audio spectrogram transformer, to instill the architectural adaptability in them. Additionally, it allows the deployment of speech foundation models on resource-constrained edge devices by reducing their computational load with minimal performance degradation. Lastly, this thesis employs knowledge distillation to further improve the performance-computation trade-off by minimizing the distance between embeddings. The proposed dynamic compression methods achieve comparable performance with 50% reduced computational complexity of what offered by full model. In addition, the compression limit is extendable to 70% with acceptable performance, enabling the model to operate in varying and constrained resource environments.| File | Dimensione | Formato | |
|---|---|---|---|
|
Phd Thesis - Abdul Hannan.pdf
accesso aperto
Licenza:
Tutti i diritti riservati
Dimensione
4.53 MB
Formato
Adobe PDF
|
4.53 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/359613
URN:NBN:IT:UNITN-359613