Machines Against Rage: Generating High-quality Counterspeech with Language Models

Bonaldi, Helena

Online hate speech is typically tackled via blocking or deletion measures. However, these actions have limited effectiveness, and they often raise questions about the protection of users' freedom of speech. In this context, counterspeech has emerged as a promising alternative strategy as it fights online hate by providing positive and de-escalatory responses. The potential effectiveness of counterspeech has motivated an increasing interest in studying ways to partially automatise its production: the goal of this work is to investigate the extent to which Natural Language Generation can be employed to pursue this task. Specifically, we will focus on how counterspeech can be automatically produced by Language Models, which are currently the most powerful tool available for text generation. In particular, we first focus on how to effectively collect counterspeech data by combining human expertise and machine generation to obtain single and multi-turn counterspeech interactions. Secondly, we fine-tune various language models on the collected data and compare their performance in generating counterspeech using different decoding mechanisms. This allows us to identify one of the major weaknesses of language models in this task: the tendency to produce vague generations that can technically work with any input but lack specificity in their content. We address this problem in two ways. First, we intervene at training time and propose two attention-based regularisation techniques to prevent lexical overfitting. Then, we test whether there are other intervening factors outside training impacting the quality of the generation. In particular, we investigate whether safety guardrails weaken a model's argumentative strength, and we test different argumentative strategies to refute hate and compare their cogency. We conclude by discussing open challenges of counterspeech research in NLP.

Machines Against Rage: Generating High-quality Counterspeech with Language Models

Bonaldi, Helena

2025

Abstract

Online hate speech is typically tackled via blocking or deletion measures. However, these actions have limited effectiveness, and they often raise questions about the protection of users' freedom of speech. In this context, counterspeech has emerged as a promising alternative strategy as it fights online hate by providing positive and de-escalatory responses. The potential effectiveness of counterspeech has motivated an increasing interest in studying ways to partially automatise its production: the goal of this work is to investigate the extent to which Natural Language Generation can be employed to pursue this task. Specifically, we will focus on how counterspeech can be automatically produced by Language Models, which are currently the most powerful tool available for text generation. In particular, we first focus on how to effectively collect counterspeech data by combining human expertise and machine generation to obtain single and multi-turn counterspeech interactions. Secondly, we fine-tune various language models on the collected data and compare their performance in generating counterspeech using different decoding mechanisms. This allows us to identify one of the major weaknesses of language models in this task: the tendency to produce vague generations that can technically work with any input but lack specificity in their content. We address this problem in two ways. First, we intervene at training time and propose two attention-based regularisation techniques to prevent lexical overfitting. Then, we test whether there are other intervening factors outside training impacting the quality of the generation. In particular, we investigate whether safety guardrails weaken a model's argumentative strength, and we test different argumentative strategies to refute hate and compare their cogency. We conclude by discussing open challenges of counterspeech research in NLP.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				Ingegneria e scienza dell'Informaz (29/10/12-)
			
	Corso di studio
	
				Information and Communication Technology
			
	Data di pubblicazione
	
				15-lug-2025
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				Guerini, Marco
			
	Nome Editore
	
				Università degli studi di Trento
			
	Città Editore
	
				TRENTO
			
	Numero di pagine
	
				175
			
	Collezione di appartenenza
	
				Università degli Studi di Trento

File in questo prodotto:

File	Dimensione	Formato
phd_thesis_final.pdf accesso aperto Licenza: Tutti i diritti riservati Dimensione 2.18 MB Formato Adobe PDF Visualizza/Apri	2.18 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/217984

Il codice NBN di questa tesi è URN:NBN:IT:UNITN-217984