Language technologies are increasingly used in a variety of applications, including speech-related tasks such as Automatic Speech Recognition (ASR) and Automatic Speech Translation (AST). However, this progress raises concerns about trustworthiness. For example, ASR systems often transcribe male and female voices with different levels of accuracy, while AST systems often reflect stereotypes, e.g. by translating "doctor" as masculine and "nurse" as feminine, or by defaulting to masculine forms even for female referents. In both recognition and translation, particularly for speaker-referred expressions (e.g., "I am a student"), these issues, known as gender bias, are linked to the variability of human voices influenced by gender. This PhD thesis addresses gender bias in ASR and AST through two main perspectives: explainability and mitigation. On the explainability side, the goal is to understand the sources of gender bias and generate insights that guide mitigation strategies. A significant portion of the thesis also contributes to the broader development of explainability within speech-to-text (S2T) applications, particularly in ASR and AST systems. It introduces the first feature attribution technique for modern autoregressive S2T models, providing fine-grained explanations for ASR and AST outputs by assessing the relevance of audio features and previously generated tokens. To investigate factors contributing to gender bias, often linked to data imbalance and variability, this thesis examines the role of acoustic and lexical features. Results indicate that pitch, speaking rate, intensity, and lexical complexity do not significantly account for gender disparities. However, feature attribution analyses reveal that vowel formants, crucial for speech recognition, have differing relevance across male and female speech and contribute to gender assignment in the translation of speaker-referred expressions. Additionally, probing the hidden representations of different model architectures reveals that the encoding of gender varies across models, and that stronger encoding is associated with reduced bias, especially in translation. On the mitigation side, previous approaches typically rely on training or fine-tuning with gender-balanced or gender-annotated speech data, which is costly and often unavailable across languages. This thesis proposes two alternatives: a data augmentation technique based on pitch manipulation that improves ASR performance for underrepresented groups without requiring additional speech data, and a mitigation method for AST that leverages gender-specialized external language models to adjust translations of speaker-referred words. The latter is applicable to existing systems without requiring retraining. Overall, this work advances explainability in speech models, deepens our understanding of gender bias, and introduces practical mitigation techniques, contributing to more trustworthy speech technologies.
Towards Trustworthy Speech Technologies: Enhancing Explainability and Mitigating Gender Bias in Speech Transcription and Translation
Fucci, Dennis
2025
Abstract
Language technologies are increasingly used in a variety of applications, including speech-related tasks such as Automatic Speech Recognition (ASR) and Automatic Speech Translation (AST). However, this progress raises concerns about trustworthiness. For example, ASR systems often transcribe male and female voices with different levels of accuracy, while AST systems often reflect stereotypes, e.g. by translating "doctor" as masculine and "nurse" as feminine, or by defaulting to masculine forms even for female referents. In both recognition and translation, particularly for speaker-referred expressions (e.g., "I am a student"), these issues, known as gender bias, are linked to the variability of human voices influenced by gender. This PhD thesis addresses gender bias in ASR and AST through two main perspectives: explainability and mitigation. On the explainability side, the goal is to understand the sources of gender bias and generate insights that guide mitigation strategies. A significant portion of the thesis also contributes to the broader development of explainability within speech-to-text (S2T) applications, particularly in ASR and AST systems. It introduces the first feature attribution technique for modern autoregressive S2T models, providing fine-grained explanations for ASR and AST outputs by assessing the relevance of audio features and previously generated tokens. To investigate factors contributing to gender bias, often linked to data imbalance and variability, this thesis examines the role of acoustic and lexical features. Results indicate that pitch, speaking rate, intensity, and lexical complexity do not significantly account for gender disparities. However, feature attribution analyses reveal that vowel formants, crucial for speech recognition, have differing relevance across male and female speech and contribute to gender assignment in the translation of speaker-referred expressions. Additionally, probing the hidden representations of different model architectures reveals that the encoding of gender varies across models, and that stronger encoding is associated with reduced bias, especially in translation. On the mitigation side, previous approaches typically rely on training or fine-tuning with gender-balanced or gender-annotated speech data, which is costly and often unavailable across languages. This thesis proposes two alternatives: a data augmentation technique based on pitch manipulation that improves ASR performance for underrepresented groups without requiring additional speech data, and a mitigation method for AST that leverages gender-specialized external language models to adjust translations of speaker-referred words. The latter is applicable to existing systems without requiring retraining. Overall, this work advances explainability in speech models, deepens our understanding of gender bias, and introduces practical mitigation techniques, contributing to more trustworthy speech technologies.| File | Dimensione | Formato | |
|---|---|---|---|
|
phd_thesis.pdf
accesso aperto
Licenza:
Tutti i diritti riservati
Dimensione
18.94 MB
Formato
Adobe PDF
|
18.94 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/307935
URN:NBN:IT:UNITN-307935