This dissertation addresses the granularity issue in sense inventories by proposing two alternative approaches to word sense disambiguation (WSD). First, a coarse-grained framework is introduced through the Coarse Sense Inventory (CSI), consisting of 45 categories optimized for interpretability and human annotator agreement. CSI demonstrates a competitive balance between performance and expressiveness, outperforming alternative inventories, particularly in few-shot learning scenarios. Second, the thesis explores lexical substitution as a viable alternative to traditional WSD, facilitated by the introduction of two novel resources: ALaSca and GeneSis. ALaSca, the first large-scale dataset for lexical substitution, leverages clustering to account for context-specific word meanings, enabling finetuned language models to surpass unsupervised baselines in candidate ranking tasks. GeneSis, a generative seq2seq approach, further advances lexical substitution by producing contextually appropriate substitutes and achieving state-of-the-art performance on substitute prediction tasks. Despite these advances, challenges remain, including limited evaluation settings and focus on English, which open avenues for future research in multilingual lexical substitution, lexical simplification, and contextual word understanding.

Overcoming sense inventories weaknesses through coarse-grained resources and lexical substitution

LACERRA, CATERINA
2022

Abstract

This dissertation addresses the granularity issue in sense inventories by proposing two alternative approaches to word sense disambiguation (WSD). First, a coarse-grained framework is introduced through the Coarse Sense Inventory (CSI), consisting of 45 categories optimized for interpretability and human annotator agreement. CSI demonstrates a competitive balance between performance and expressiveness, outperforming alternative inventories, particularly in few-shot learning scenarios. Second, the thesis explores lexical substitution as a viable alternative to traditional WSD, facilitated by the introduction of two novel resources: ALaSca and GeneSis. ALaSca, the first large-scale dataset for lexical substitution, leverages clustering to account for context-specific word meanings, enabling finetuned language models to surpass unsupervised baselines in candidate ranking tasks. GeneSis, a generative seq2seq approach, further advances lexical substitution by producing contextually appropriate substitutes and achieving state-of-the-art performance on substitute prediction tasks. Despite these advances, challenges remain, including limited evaluation settings and focus on English, which open avenues for future research in multilingual lexical substitution, lexical simplification, and contextual word understanding.
20-mag-2022
Inglese
NAVIGLI, Roberto
Università degli Studi di Roma "La Sapienza"
File in questo prodotto:
File Dimensione Formato  
Tesi_dottorato_Lacerra.pdf

accesso aperto

Dimensione 4.88 MB
Formato Adobe PDF
4.88 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/187887
Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA1-187887