Knowledge graphs (KGs) are networks of real-world entities with their relationships and properties and are more and more used as a means for the integration of heterogeneous sources of information in a common model that facilitates the interoperability of different applications and generates a huge quantity of information that can be exploited for machine learning predictions. Many approaches were proposed for the construction of KGs starting from tables extracted from spreadsheets, Web tables, or tables contained in digital documents, that entail the location and segmentation of the table in the source document, the extraction of its components, the identification of the function of different areas and the discrimination of relationships between attributes. However, the semantic characterization of the table content in terms of a domain ontology and the generation of the KG are still open research problems because of the heterogeneity of the table contents, the eventual presence of mistakes, and the lack of standardization. The goal of this thesis is the development of an approach for supporting the user in the construction of a knowledge graph, which is compliant with a domain ontology, starting from tabular data presenting a complex structure and syntactic and semantic mistakes. We believe that a completely automatic approach that exploits sophisticated machine learning (ML) techniques cannot properly be used in this context. A semi-automatic approach can be devised in the process of data cleaning, semantic characterization of the table content, and translation in the KG representation. Users need to be supported by easy-to-use graphical interfaces for correcting mistakes and improving the system’s overall performances. For these reasons, in this thesis, we have devised a three phases approach. The first phase focuses on the semantic characterization of table columns in terms of the basic types and/or properties of a domain ontology. In this phase, a table is extracted from a spreadsheet and different cleaning activities are carried out (like removal of headers and footers, removal of blank and semi-blank rows, and detection of table rows that are correlated using a declarative pattern-based language). Then, through the identification of basic types of table columns, syntactic mistakes are identified and the user can correct them by exploiting different interfaces. This process improves the characterization of the semantic concepts contained in the table. The second phase of the approach focuses on the definition of a semantic description of the table content w.r.t. a domain ontology. The description is created starting from the result of the previous phase and exploits a graph neural network model for the identification of the relations that bind the concepts contained in the table. The third phase focuses on the generation of the triples of the KG starting from the table content and the semantic description. In this activity, we have developed interfaces for the identification of semantic mistakes occurring in the data and for the specification of identifiers of the KG instances. Different experiments have been conducted for validating the three phases of the approach proposed in the thesis and the usability of the entire system.

A SEMANTIC APPROACH FOR CONSTRUCTING KNOWLEDGE GRAPHS EXTRACTED FROM TABLES

BONFITTO, SARA
2023

Abstract

Knowledge graphs (KGs) are networks of real-world entities with their relationships and properties and are more and more used as a means for the integration of heterogeneous sources of information in a common model that facilitates the interoperability of different applications and generates a huge quantity of information that can be exploited for machine learning predictions. Many approaches were proposed for the construction of KGs starting from tables extracted from spreadsheets, Web tables, or tables contained in digital documents, that entail the location and segmentation of the table in the source document, the extraction of its components, the identification of the function of different areas and the discrimination of relationships between attributes. However, the semantic characterization of the table content in terms of a domain ontology and the generation of the KG are still open research problems because of the heterogeneity of the table contents, the eventual presence of mistakes, and the lack of standardization. The goal of this thesis is the development of an approach for supporting the user in the construction of a knowledge graph, which is compliant with a domain ontology, starting from tabular data presenting a complex structure and syntactic and semantic mistakes. We believe that a completely automatic approach that exploits sophisticated machine learning (ML) techniques cannot properly be used in this context. A semi-automatic approach can be devised in the process of data cleaning, semantic characterization of the table content, and translation in the KG representation. Users need to be supported by easy-to-use graphical interfaces for correcting mistakes and improving the system’s overall performances. For these reasons, in this thesis, we have devised a three phases approach. The first phase focuses on the semantic characterization of table columns in terms of the basic types and/or properties of a domain ontology. In this phase, a table is extracted from a spreadsheet and different cleaning activities are carried out (like removal of headers and footers, removal of blank and semi-blank rows, and detection of table rows that are correlated using a declarative pattern-based language). Then, through the identification of basic types of table columns, syntactic mistakes are identified and the user can correct them by exploiting different interfaces. This process improves the characterization of the semantic concepts contained in the table. The second phase of the approach focuses on the definition of a semantic description of the table content w.r.t. a domain ontology. The description is created starting from the result of the previous phase and exploits a graph neural network model for the identification of the relations that bind the concepts contained in the table. The third phase focuses on the generation of the triples of the KG starting from the table content and the semantic description. In this activity, we have developed interfaces for the identification of semantic mistakes occurring in the data and for the specification of identifiers of the KG instances. Different experiments have been conducted for validating the three phases of the approach proposed in the thesis and the usability of the entire system.
27-apr-2023
Inglese
knowledge graphs; semantic labelling; ontology
MESITI, MARCO
SASSI, ROBERTO
Università degli Studi di Milano
File in questo prodotto:
File Dimensione Formato  
phd_unimi_R12571.pdf

accesso aperto

Dimensione 6.89 MB
Formato Adobe PDF
6.89 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/82141
Il codice NBN di questa tesi è URN:NBN:IT:UNIMI-82141