Overcoming sense inventories weaknesses through coarse-grained resources and lexical substitution

Lacerra, Caterina

This dissertation addresses the granularity issue in sense inventories by proposing two alternative approaches to word sense disambiguation (WSD). First, a coarse-grained framework is introduced through the Coarse Sense Inventory (CSI), consisting of 45 categories optimized for interpretability and human annotator agreement. CSI demonstrates a competitive balance between performance and expressiveness, outperforming alternative inventories, particularly in few-shot learning scenarios. Second, the thesis explores lexical substitution as a viable alternative to traditional WSD, facilitated by the introduction of two novel resources: ALaSca and GeneSis. ALaSca, the first large-scale dataset for lexical substitution, leverages clustering to account for context-specific word meanings, enabling finetuned language models to surpass unsupervised baselines in candidate ranking tasks. GeneSis, a generative seq2seq approach, further advances lexical substitution by producing contextually appropriate substitutes and achieving state-of-the-art performance on substitute prediction tasks. Despite these advances, challenges remain, including limited evaluation settings and focus on English, which open avenues for future research in multilingual lexical substitution, lexical simplification, and contextual word understanding.

Overcoming sense inventories weaknesses through coarse-grained resources and lexical substitution

LACERRA, CATERINA

2022

Abstract

This dissertation addresses the granularity issue in sense inventories by proposing two alternative approaches to word sense disambiguation (WSD). First, a coarse-grained framework is introduced through the Coarse Sense Inventory (CSI), consisting of 45 categories optimized for interpretability and human annotator agreement. CSI demonstrates a competitive balance between performance and expressiveness, outperforming alternative inventories, particularly in few-shot learning scenarios. Second, the thesis explores lexical substitution as a viable alternative to traditional WSD, facilitated by the introduction of two novel resources: ALaSca and GeneSis. ALaSca, the first large-scale dataset for lexical substitution, leverages clustering to account for context-specific word meanings, enabling finetuned language models to surpass unsupervised baselines in candidate ranking tasks. GeneSis, a generative seq2seq approach, further advances lexical substitution by producing contextually appropriate substitutes and achieving state-of-the-art performance on substitute prediction tasks. Despite these advances, challenges remain, including limited evaluation settings and focus on English, which open avenues for future research in multilingual lexical substitution, lexical simplification, and contextual word understanding.

Scheda breve

Scheda completa

Scheda completa (DC)

	Corso di studio
	
				Informatica
			
	Data di pubblicazione
	
				20-mag-2022
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				NAVIGLI, Roberto
			
	Nome Editore
	
				Università degli Studi di Roma "La Sapienza"
			
	Collezione di appartenenza
	
				Università degli Studi di Roma La Sapienza

File in questo prodotto:

File	Dimensione	Formato
Tesi_dottorato_Lacerra.pdf accesso aperto Licenza: Tutti i diritti riservati Dimensione 4.88 MB Formato Adobe PDF Visualizza/Apri	4.88 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/187887

Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA1-187887