This dissertation addresses the granularity issue in sense inventories by proposing two alternative approaches to word sense disambiguation (WSD). First, a coarse-grained framework is introduced through the Coarse Sense Inventory (CSI), consisting of 45 categories optimized for interpretability and human annotator agreement. CSI demonstrates a competitive balance between performance and expressiveness, outperforming alternative inventories, particularly in few-shot learning scenarios. Second, the thesis explores lexical substitution as a viable alternative to traditional WSD, facilitated by the introduction of two novel resources: ALaSca and GeneSis. ALaSca, the first large-scale dataset for lexical substitution, leverages clustering to account for context-specific word meanings, enabling finetuned language models to surpass unsupervised baselines in candidate ranking tasks. GeneSis, a generative seq2seq approach, further advances lexical substitution by producing contextually appropriate substitutes and achieving state-of-the-art performance on substitute prediction tasks. Despite these advances, challenges remain, including limited evaluation settings and focus on English, which open avenues for future research in multilingual lexical substitution, lexical simplification, and contextual word understanding.
Overcoming sense inventories weaknesses through coarse-grained resources and lexical substitution
LACERRA, CATERINA
2022
Abstract
This dissertation addresses the granularity issue in sense inventories by proposing two alternative approaches to word sense disambiguation (WSD). First, a coarse-grained framework is introduced through the Coarse Sense Inventory (CSI), consisting of 45 categories optimized for interpretability and human annotator agreement. CSI demonstrates a competitive balance between performance and expressiveness, outperforming alternative inventories, particularly in few-shot learning scenarios. Second, the thesis explores lexical substitution as a viable alternative to traditional WSD, facilitated by the introduction of two novel resources: ALaSca and GeneSis. ALaSca, the first large-scale dataset for lexical substitution, leverages clustering to account for context-specific word meanings, enabling finetuned language models to surpass unsupervised baselines in candidate ranking tasks. GeneSis, a generative seq2seq approach, further advances lexical substitution by producing contextually appropriate substitutes and achieving state-of-the-art performance on substitute prediction tasks. Despite these advances, challenges remain, including limited evaluation settings and focus on English, which open avenues for future research in multilingual lexical substitution, lexical simplification, and contextual word understanding.File | Dimensione | Formato | |
---|---|---|---|
Tesi_dottorato_Lacerra.pdf
accesso aperto
Dimensione
4.88 MB
Formato
Adobe PDF
|
4.88 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/187887
URN:NBN:IT:UNIROMA1-187887