Ontology-based information extraction experiences, framework, algorithms and tools

Scafoglieri, Federico

A significant portion of the information collected by enterprises and organizations resides in text documents and is thus inherently unstructured. Turning it into a structured form is the aim of Information Extraction (IE). Depending on the approach, the output of an IE process can fill forms, populate relational tables, or even be presented through an ontology. This last approach, known in the literature under the name of Ontology Based Information Extraction (OBIE), is particularly interesting, since ontologies may facilitate the integration with other corporate and external data and enable data management and governance at an abstract, conceptual level.However, despite OBIE has been so far the subject of several investigations, how to exploit the reasoning abilities offered by an ontology to improve the extraction process has not yet been specifically studied. This thesis is intended to be a first step in that direction. Starting from our experience gained from implementing OBIE systems via open-source technologies, and with the intent to address the encountered weaknesses, we propose a formal framework for OBIE, called Ontology Based Document Spanning (OBDS). We devise our proposal by revisiting the Ontology Based Data Access (ODBA) paradigm, a sophisticated form of semantic data integration from relational databases, and leveraging the investigation on Document Spanners, a recent formal study of rule-based information extraction that follows the database principles. The reasoning service of main interest in OBDS, as usual in ontology based data management approaches, is Query Answering (Q. A.). We provide an analysis of this service in different settings and propose algorithms for Q. A., in the spirit of OBDA. Right here we show how the ontology plays a major role by mediating the extraction of information from text. To demonstrate the applicability of our approach in practice, we illustrate Mastro System-T, an OBDS tool that we have implemented using robust industrial technologies and experimented on large document datasets. Last but not least, we formally treat the problem of the Entity Resolution (ER), which is recurrent in the OBIE context, as in general in information integration approaches.

Ontology-based information extraction experiences, framework, algorithms and tools

Scafoglieri, Federico

2021

Abstract

A significant portion of the information collected by enterprises and organizations resides in text documents and is thus inherently unstructured. Turning it into a structured form is the aim of Information Extraction (IE). Depending on the approach, the output of an IE process can fill forms, populate relational tables, or even be presented through an ontology. This last approach, known in the literature under the name of Ontology Based Information Extraction (OBIE), is particularly interesting, since ontologies may facilitate the integration with other corporate and external data and enable data management and governance at an abstract, conceptual level.However, despite OBIE has been so far the subject of several investigations, how to exploit the reasoning abilities offered by an ontology to improve the extraction process has not yet been specifically studied. This thesis is intended to be a first step in that direction. Starting from our experience gained from implementing OBIE systems via open-source technologies, and with the intent to address the encountered weaknesses, we propose a formal framework for OBIE, called Ontology Based Document Spanning (OBDS). We devise our proposal by revisiting the Ontology Based Data Access (ODBA) paradigm, a sophisticated form of semantic data integration from relational databases, and leveraging the investigation on Document Spanners, a recent formal study of rule-based information extraction that follows the database principles. The reasoning service of main interest in OBDS, as usual in ontology based data management approaches, is Query Answering (Q. A.). We provide an analysis of this service in different settings and propose algorithms for Q. A., in the spirit of OBDA. Right here we show how the ontology plays a major role by mediating the extraction of information from text. To demonstrate the applicability of our approach in practice, we illustrate Mastro System-T, an OBDS tool that we have implemented using robust industrial technologies and experimented on large document datasets. Last but not least, we formally treat the problem of the Entity Resolution (ER), which is recurrent in the OBIE context, as in general in information integration approaches.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				DIPARTIMENTO DI INGEGNERIA INFORMATICA, AUTOMATICA E GESTIONALE -ANTONIO RUBERTI-
			
	Corso di studio
	
				Ingegneria informatica
			
	Data di pubblicazione
	
				17-set-2021
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				LEMBO, Domenico
			
	Correlatore, Controrelatore, Co-Supervisor,  Co-Tutor o Coordinatori
	
				DEMETRESCU, Camil
			
	Nome Editore
	
				Università degli Studi di Roma "La Sapienza"
			
	Numero di pagine
	
				207
			
	Collezione di appartenenza
	
				Università degli Studi di Roma La Sapienza

File in questo prodotto:

File	Dimensione	Formato
Tesi_dottorato_Scafoglieri.pdf accesso aperto Licenza: Tutti i diritti riservati Dimensione 4.76 MB Formato Adobe PDF Visualizza/Apri	4.76 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/182543

Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA1-182543