Online hate speech is typically tackled via blocking or deletion measures. However, these actions have limited effectiveness, and they often raise questions about the protection of users' freedom of speech. In this context, counterspeech has emerged as a promising alternative strategy as it fights online hate by providing positive and de-escalatory responses. The potential effectiveness of counterspeech has motivated an increasing interest in studying ways to partially automatise its production: the goal of this work is to investigate the extent to which Natural Language Generation can be employed to pursue this task. Specifically, we will focus on how counterspeech can be automatically produced by Language Models, which are currently the most powerful tool available for text generation. In particular, we first focus on how to effectively collect counterspeech data by combining human expertise and machine generation to obtain single and multi-turn counterspeech interactions. Secondly, we fine-tune various language models on the collected data and compare their performance in generating counterspeech using different decoding mechanisms. This allows us to identify one of the major weaknesses of language models in this task: the tendency to produce vague generations that can technically work with any input but lack specificity in their content. We address this problem in two ways. First, we intervene at training time and propose two attention-based regularisation techniques to prevent lexical overfitting. Then, we test whether there are other intervening factors outside training impacting the quality of the generation. In particular, we investigate whether safety guardrails weaken a model's argumentative strength, and we test different argumentative strategies to refute hate and compare their cogency. We conclude by discussing open challenges of counterspeech research in NLP.
Machines Against Rage: Generating High-quality Counterspeech with Language Models
Bonaldi, Helena
2025
Abstract
Online hate speech is typically tackled via blocking or deletion measures. However, these actions have limited effectiveness, and they often raise questions about the protection of users' freedom of speech. In this context, counterspeech has emerged as a promising alternative strategy as it fights online hate by providing positive and de-escalatory responses. The potential effectiveness of counterspeech has motivated an increasing interest in studying ways to partially automatise its production: the goal of this work is to investigate the extent to which Natural Language Generation can be employed to pursue this task. Specifically, we will focus on how counterspeech can be automatically produced by Language Models, which are currently the most powerful tool available for text generation. In particular, we first focus on how to effectively collect counterspeech data by combining human expertise and machine generation to obtain single and multi-turn counterspeech interactions. Secondly, we fine-tune various language models on the collected data and compare their performance in generating counterspeech using different decoding mechanisms. This allows us to identify one of the major weaknesses of language models in this task: the tendency to produce vague generations that can technically work with any input but lack specificity in their content. We address this problem in two ways. First, we intervene at training time and propose two attention-based regularisation techniques to prevent lexical overfitting. Then, we test whether there are other intervening factors outside training impacting the quality of the generation. In particular, we investigate whether safety guardrails weaken a model's argumentative strength, and we test different argumentative strategies to refute hate and compare their cogency. We conclude by discussing open challenges of counterspeech research in NLP.File | Dimensione | Formato | |
---|---|---|---|
phd_thesis_final.pdf
accesso aperto
Dimensione
2.18 MB
Formato
Adobe PDF
|
2.18 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/217984
URN:NBN:IT:UNITN-217984