With the widespread adoption of Language Models (LMs) in high-stakes decision-making contexts, mitigating discrimination and improving explainability have emerged as critical challenges. Biases are not only encoded in the internal representations of these models but also perpetuated in downstream tasks, often resulting in disparate outcomes across different demographic groups. Concurrently, the opaqueness of inner mechanisms raises trust concerns due to the difficulty of accounting for and explaining their behavior. This thesis addresses these issues by developing methodologies for fairness auditing, explainability, and debiasing of linguistic data and LMs. Contributions are organized around three core research questions, each of which corresponds to a distinct part of the work: i) how can data mitigation pipelines be implemented to reduce biases in training data; ii) what fairness and explainability-based protocols can be developed to audit and detect biases in task-specific LMs, such as NLP classifiers; iii) how can bias auditing be performed on pre-trained LMs, i.e., before they are fine-tuned for specific applications. The first part of the thesis introduces novel data mitigation frameworks, including sampling strategies, data generation, and augmentation to improve the balance between demographic groups. We evaluate the effects of these strategies on both classification outcomes and explanations. Experiments demonstrate that these interventions often improve fairness with minimal performance loss and that explicitly incorporating sensitive attributes into the classification model is more effective than masking them. The second part shifts the focus to fairness auditing pipelines of NLP classifiers, particularly for the task of detecting abusive language. By leveraging local explainers and template-based evaluations, we investigate the potential of adopting explainability as a tool to conduct discrimination discovery. The results emphasize that even high-performing classifiers struggle with implicit bias. The third and final part extends fairness auditing to general-purpose LMs, introducing new probing methods, benchmarking tools, and resources for detecting representational harms. The evaluations show that LMs encode harmful stereotypes and that model scale, architecture, and training influence the degree of bias. This dissertation argues that fairness and explainability must be central to the development of language technologies, ensuring their accountability and featuring them as safer and worthy of public trust. Ultimately, by emphasizing the importance of context-sensitive auditing, the research conducted contributes to promoting NLP practices that are more inclusive, transparent, and accountable.
Fairness Auditing, Explanation and Debiasing in Linguistic Data and Language Models
MARCHIORI MANERBA, MARTA
2025
Abstract
With the widespread adoption of Language Models (LMs) in high-stakes decision-making contexts, mitigating discrimination and improving explainability have emerged as critical challenges. Biases are not only encoded in the internal representations of these models but also perpetuated in downstream tasks, often resulting in disparate outcomes across different demographic groups. Concurrently, the opaqueness of inner mechanisms raises trust concerns due to the difficulty of accounting for and explaining their behavior. This thesis addresses these issues by developing methodologies for fairness auditing, explainability, and debiasing of linguistic data and LMs. Contributions are organized around three core research questions, each of which corresponds to a distinct part of the work: i) how can data mitigation pipelines be implemented to reduce biases in training data; ii) what fairness and explainability-based protocols can be developed to audit and detect biases in task-specific LMs, such as NLP classifiers; iii) how can bias auditing be performed on pre-trained LMs, i.e., before they are fine-tuned for specific applications. The first part of the thesis introduces novel data mitigation frameworks, including sampling strategies, data generation, and augmentation to improve the balance between demographic groups. We evaluate the effects of these strategies on both classification outcomes and explanations. Experiments demonstrate that these interventions often improve fairness with minimal performance loss and that explicitly incorporating sensitive attributes into the classification model is more effective than masking them. The second part shifts the focus to fairness auditing pipelines of NLP classifiers, particularly for the task of detecting abusive language. By leveraging local explainers and template-based evaluations, we investigate the potential of adopting explainability as a tool to conduct discrimination discovery. The results emphasize that even high-performing classifiers struggle with implicit bias. The third and final part extends fairness auditing to general-purpose LMs, introducing new probing methods, benchmarking tools, and resources for detecting representational harms. The evaluations show that LMs encode harmful stereotypes and that model scale, architecture, and training influence the degree of bias. This dissertation argues that fairness and explainability must be central to the development of language technologies, ensuring their accountability and featuring them as safer and worthy of public trust. Ultimately, by emphasizing the importance of context-sensitive auditing, the research conducted contributes to promoting NLP practices that are more inclusive, transparent, and accountable.File | Dimensione | Formato | |
---|---|---|---|
PhD_Thesis___Marta_Marchiori_Manerba.pdf
accesso aperto
Dimensione
20.23 MB
Formato
Adobe PDF
|
20.23 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/219615
URN:NBN:IT:UNIPI-219615