With the widespread adoption of Language Models (LMs) in high-stakes decision-making contexts, mitigating discrimination and improving explainability have emerged as critical challenges. Biases are not only encoded in the internal representations of these models but also perpetuated in downstream tasks, often resulting in disparate outcomes across different demographic groups. Concurrently, the opaqueness of inner mechanisms raises trust concerns due to the difficulty of accounting for and explaining their behavior. This thesis addresses these issues by developing methodologies for fairness auditing, explainability, and debiasing of linguistic data and LMs. Contributions are organized around three core research questions, each of which corresponds to a distinct part of the work: i) how can data mitigation pipelines be implemented to reduce biases in training data; ii) what fairness and explainability-based protocols can be developed to audit and detect biases in task-specific LMs, such as NLP classifiers; iii) how can bias auditing be performed on pre-trained LMs, i.e., before they are fine-tuned for specific applications. The first part of the thesis introduces novel data mitigation frameworks, including sampling strategies, data generation, and augmentation to improve the balance between demographic groups. We evaluate the effects of these strategies on both classification outcomes and explanations. Experiments demonstrate that these interventions often improve fairness with minimal performance loss and that explicitly incorporating sensitive attributes into the classification model is more effective than masking them. The second part shifts the focus to fairness auditing pipelines of NLP classifiers, particularly for the task of detecting abusive language. By leveraging local explainers and template-based evaluations, we investigate the potential of adopting explainability as a tool to conduct discrimination discovery. The results emphasize that even high-performing classifiers struggle with implicit bias. The third and final part extends fairness auditing to general-purpose LMs, introducing new probing methods, benchmarking tools, and resources for detecting representational harms. The evaluations show that LMs encode harmful stereotypes and that model scale, architecture, and training influence the degree of bias. This dissertation argues that fairness and explainability must be central to the development of language technologies, ensuring their accountability and featuring them as safer and worthy of public trust. Ultimately, by emphasizing the importance of context-sensitive auditing, the research conducted contributes to promoting NLP practices that are more inclusive, transparent, and accountable.

Fairness Auditing, Explanation and Debiasing in Linguistic Data and Language Models

MARCHIORI MANERBA, MARTA
2025

Abstract

With the widespread adoption of Language Models (LMs) in high-stakes decision-making contexts, mitigating discrimination and improving explainability have emerged as critical challenges. Biases are not only encoded in the internal representations of these models but also perpetuated in downstream tasks, often resulting in disparate outcomes across different demographic groups. Concurrently, the opaqueness of inner mechanisms raises trust concerns due to the difficulty of accounting for and explaining their behavior. This thesis addresses these issues by developing methodologies for fairness auditing, explainability, and debiasing of linguistic data and LMs. Contributions are organized around three core research questions, each of which corresponds to a distinct part of the work: i) how can data mitigation pipelines be implemented to reduce biases in training data; ii) what fairness and explainability-based protocols can be developed to audit and detect biases in task-specific LMs, such as NLP classifiers; iii) how can bias auditing be performed on pre-trained LMs, i.e., before they are fine-tuned for specific applications. The first part of the thesis introduces novel data mitigation frameworks, including sampling strategies, data generation, and augmentation to improve the balance between demographic groups. We evaluate the effects of these strategies on both classification outcomes and explanations. Experiments demonstrate that these interventions often improve fairness with minimal performance loss and that explicitly incorporating sensitive attributes into the classification model is more effective than masking them. The second part shifts the focus to fairness auditing pipelines of NLP classifiers, particularly for the task of detecting abusive language. By leveraging local explainers and template-based evaluations, we investigate the potential of adopting explainability as a tool to conduct discrimination discovery. The results emphasize that even high-performing classifiers struggle with implicit bias. The third and final part extends fairness auditing to general-purpose LMs, introducing new probing methods, benchmarking tools, and resources for detecting representational harms. The evaluations show that LMs encode harmful stereotypes and that model scale, architecture, and training influence the degree of bias. This dissertation argues that fairness and explainability must be central to the development of language technologies, ensuring their accountability and featuring them as safer and worthy of public trust. Ultimately, by emphasizing the importance of context-sensitive auditing, the research conducted contributes to promoting NLP practices that are more inclusive, transparent, and accountable.
13-lug-2025
Inglese
ML
NLP
Explainability
Interpretability
ML Evaluation
Fairness in ML
Algorithmic Bias
Bias Mitigation
Algorithmic Auditing
Data Awareness
Data Equity
Discrimination
Stereotypes
Intersectionality
Guidotti, Riccardo
Ruggieri, Salvatore
File in questo prodotto:
File Dimensione Formato  
PhD_Thesis___Marta_Marchiori_Manerba.pdf

accesso aperto

Dimensione 20.23 MB
Formato Adobe PDF
20.23 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/219615
Il codice NBN di questa tesi è URN:NBN:IT:UNIPI-219615