This thesis focuses on the design of algorithms for the extraction of knowledge (in terms of entities belonging to a knowledge graph) and information (in terms of open facts) from text through the use of Wikipedia as main repository of world knowledge. The first part of the dissertation focuses on research problems that specifically lie in the domain of knowledge and information extraction. In this context, we contribute to the scientific literature with the following three achievements: first, we study the problem of computing the relatedness between Wikipedia entities, through the introduction of a new dataset of human judgements complemented by a study of all entity relatedness measures proposed in recent literature as well as with the proposal of a new computationally lightweight two-stage framework for relatedness computation; second, we study the problem of entity salience through the design and implementation of a new system that aims at identifying the salient Wikipedia entities occurring in an input text and that improves the state-of-the-art over different datasets; third, we introduce a new research problem called fact salience, which addresses the task of detecting salient open facts extracted from an input text, and we propose, design and implement the first system that efficaciously solves it. In the second part of the dissertation we study an application of knowledge extraction tools in the domain of expert finding. We propose a new system which hinges upon a novel profiling technique that models people (i.e., experts) through a small and labeled graph drawn from Wikipedia. This new profiling technique is then used for designing a novel suite of ranking algorithms for matching the user query and whose effectiveness is shown by improving state-of-the-art solutions.
Algorithms for Knowledge and Information Extraction in Text with Wikipedia
2019
Abstract
This thesis focuses on the design of algorithms for the extraction of knowledge (in terms of entities belonging to a knowledge graph) and information (in terms of open facts) from text through the use of Wikipedia as main repository of world knowledge. The first part of the dissertation focuses on research problems that specifically lie in the domain of knowledge and information extraction. In this context, we contribute to the scientific literature with the following three achievements: first, we study the problem of computing the relatedness between Wikipedia entities, through the introduction of a new dataset of human judgements complemented by a study of all entity relatedness measures proposed in recent literature as well as with the proposal of a new computationally lightweight two-stage framework for relatedness computation; second, we study the problem of entity salience through the design and implementation of a new system that aims at identifying the salient Wikipedia entities occurring in an input text and that improves the state-of-the-art over different datasets; third, we introduce a new research problem called fact salience, which addresses the task of detecting salient open facts extracted from an input text, and we propose, design and implement the first system that efficaciously solves it. In the second part of the dissertation we study an application of knowledge extraction tools in the domain of expert finding. We propose a new system which hinges upon a novel profiling technique that models people (i.e., experts) through a small and labeled graph drawn from Wikipedia. This new profiling technique is then used for designing a novel suite of ranking algorithms for matching the user query and whose effectiveness is shown by improving state-of-the-art solutions.File | Dimensione | Formato | |
---|---|---|---|
dissertation.pdf
accesso aperto
Tipologia:
Altro materiale allegato
Dimensione
6.83 MB
Formato
Adobe PDF
|
6.83 MB | Adobe PDF | Visualizza/Apri |
report.pdf
accesso aperto
Tipologia:
Altro materiale allegato
Dimensione
29.44 kB
Formato
Adobe PDF
|
29.44 kB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/134210
URN:NBN:IT:UNIPI-134210