Languages are well known to be diverse on all structural levels, from the smallest (phonemic) to the broadest (pragmatic). We propose a set of formal, quantitative measures for the language diversity of linguistic phenomena, the resource incompleteness, and resource incorrectness. We apply all these measures to lexical semantics where we show how evidence of a high degree of universality within a given language set can be used to extend lexico-semantic resources in a precise, diversity-aware manner. We demonstrate our approach on several case studies: First is on polysemes and homographs among cases of lexical ambiguity. Contrarily to past research that focused solely on exploiting systematic polysemy, the notion of universality provides us with an automated method also capable of predicting irregular polysemes. Second is to automatically identify cognates from the existing lexical resource across different orthographies of genetically unrelated languages. Contrarily to past research that focused on detecting cognates from 225 concepts of Swadesh list, we captured 3.1 million cognate pairs across 40 different orthographies and 335 languages by exploiting the existing wordnet-like lexical resources.

Understanding and Exploiting Language Diversity

Batsuren, Khuyagbaatar
2018

Abstract

Languages are well known to be diverse on all structural levels, from the smallest (phonemic) to the broadest (pragmatic). We propose a set of formal, quantitative measures for the language diversity of linguistic phenomena, the resource incompleteness, and resource incorrectness. We apply all these measures to lexical semantics where we show how evidence of a high degree of universality within a given language set can be used to extend lexico-semantic resources in a precise, diversity-aware manner. We demonstrate our approach on several case studies: First is on polysemes and homographs among cases of lexical ambiguity. Contrarily to past research that focused solely on exploiting systematic polysemy, the notion of universality provides us with an automated method also capable of predicting irregular polysemes. Second is to automatically identify cognates from the existing lexical resource across different orthographies of genetically unrelated languages. Contrarily to past research that focused on detecting cognates from 225 concepts of Swadesh list, we captured 3.1 million cognate pairs across 40 different orthographies and 335 languages by exploiting the existing wordnet-like lexical resources.
2018
Inglese
Giunchiglia, Fausto
Università degli studi di Trento
TRENTO
95
File in questo prodotto:
File Dimensione Formato  
disclaimer_batsuren.pdf

accesso solo da BNCF e BNCR

Licenza: Tutti i diritti riservati
Dimensione 1.09 MB
Formato Adobe PDF
1.09 MB Adobe PDF
thesis_(47).pdf

Open Access dal 02/01/2020

Licenza: Tutti i diritti riservati
Dimensione 6.51 MB
Formato Adobe PDF
6.51 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/92229
Il codice NBN di questa tesi è URN:NBN:IT:UNITN-92229