A significant portion of the information collected by enterprises and organizations resides in text documents and is thus inherently unstructured. Turning it into a structured form is the aim of Information Extraction (IE). Depending on the approach, the output of an IE process can fill forms, populate relational tables, or even be presented through an ontology. This last approach, known in the literature under the name of Ontology Based Information Extraction (OBIE), is particularly interesting, since ontologies may facilitate the integration with other corporate and external data and enable data management and governance at an abstract, conceptual level.However, despite OBIE has been so far the subject of several investigations, how to exploit the reasoning abilities offered by an ontology to improve the extraction process has not yet been specifically studied. This thesis is intended to be a first step in that direction. Starting from our experience gained from implementing OBIE systems via open-source technologies, and with the intent to address the encountered weaknesses, we propose a formal framework for OBIE, called Ontology Based Document Spanning (OBDS). We devise our proposal by revisiting the Ontology Based Data Access (ODBA) paradigm, a sophisticated form of semantic data integration from relational databases, and leveraging the investigation on Document Spanners, a recent formal study of rule-based information extraction that follows the database principles. The reasoning service of main interest in OBDS, as usual in ontology based data management approaches, is Query Answering (Q. A.). We provide an analysis of this service in different settings and propose algorithms for Q. A., in the spirit of OBDA. Right here we show how the ontology plays a major role by mediating the extraction of information from text. To demonstrate the applicability of our approach in practice, we illustrate Mastro System-T, an OBDS tool that we have implemented using robust industrial technologies and experimented on large document datasets. Last but not least, we formally treat the problem of the Entity Resolution (ER), which is recurrent in the OBIE context, as in general in information integration approaches.

Ontology-based information extraction experiences, framework, algorithms and tools

Scafoglieri, Federico
2021

Abstract

A significant portion of the information collected by enterprises and organizations resides in text documents and is thus inherently unstructured. Turning it into a structured form is the aim of Information Extraction (IE). Depending on the approach, the output of an IE process can fill forms, populate relational tables, or even be presented through an ontology. This last approach, known in the literature under the name of Ontology Based Information Extraction (OBIE), is particularly interesting, since ontologies may facilitate the integration with other corporate and external data and enable data management and governance at an abstract, conceptual level.However, despite OBIE has been so far the subject of several investigations, how to exploit the reasoning abilities offered by an ontology to improve the extraction process has not yet been specifically studied. This thesis is intended to be a first step in that direction. Starting from our experience gained from implementing OBIE systems via open-source technologies, and with the intent to address the encountered weaknesses, we propose a formal framework for OBIE, called Ontology Based Document Spanning (OBDS). We devise our proposal by revisiting the Ontology Based Data Access (ODBA) paradigm, a sophisticated form of semantic data integration from relational databases, and leveraging the investigation on Document Spanners, a recent formal study of rule-based information extraction that follows the database principles. The reasoning service of main interest in OBDS, as usual in ontology based data management approaches, is Query Answering (Q. A.). We provide an analysis of this service in different settings and propose algorithms for Q. A., in the spirit of OBDA. Right here we show how the ontology plays a major role by mediating the extraction of information from text. To demonstrate the applicability of our approach in practice, we illustrate Mastro System-T, an OBDS tool that we have implemented using robust industrial technologies and experimented on large document datasets. Last but not least, we formally treat the problem of the Entity Resolution (ER), which is recurrent in the OBIE context, as in general in information integration approaches.
17-set-2021
Inglese
LEMBO, Domenico
DEMETRESCU, Camil
Università degli Studi di Roma "La Sapienza"
207
File in questo prodotto:
File Dimensione Formato  
Tesi_dottorato_Scafoglieri.pdf

accesso aperto

Dimensione 4.76 MB
Formato Adobe PDF
4.76 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/182543
Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA1-182543