The rise of Open Science, which aims to enhance the accessibility, transparency, and collaboration in scientific research, has led to an exponential increase in the volume of published datasets. In this evolving landscape, tasks such as data publishing and data citation are essential to ensure the long-term persistence, accessibility, and discoverability of research datasets and their authors. However, the lack of a universal agreement on how to publish and cite data caused the coexistence of a wide range of publication and citation practices, including a variety of approaches to defining metadata --containing information about provenance, date, description, authors-- which often are noisy, incomplete, and under-represent the related dataset. SKGs --i.e., large, heterogeneous graphs that semantically link research outputs-- play a crucial role in organizing and structuring the increasing volume of scholarly knowledge. On the other hand, these graphs are not curated, compromising their reliability and affecting their potential for reuse. Furthermore, the presence of poor quality metadata results in numerous isolated or weakly connected nodes within the SKGs. In this context, link prediction and dataset recommendation methods become critical for enhancing graph connectivity, improving the discoverability of datasets, and increasing the visibility and impact of their authors. These tasks help mitigate the challenges posed by incomplete metadata, promoting the reusability of research data and advancing scientific collaboration. In this thesis, we address the aforementioned challenges through three key research directions. First, we introduce two strategies to perform scholarly data curation; the first one is a curation pipeline that we applied on a research community within the OAG to curate and validate nodes and relationships; while the second one is an annotation tool for textual documents useful to create large annotated ground truth corpora and curate scholarly data, identifying new data citations, and new relationships between different outcomes. Second, we use this curated graph to analyze prevalent citation patterns in scholarly data, examining how various citation practices impact discoverability and credit attribution. Lastly, we replicate, reproduce and generalize some of the most recent dataset recommendation methods discussing their pros and cons, and we leverage the derived observation to design SAN, a novel GRL method designed to perform link prediction and dataset recommendation in real-world scenarios, effectively handling the heterogeneous, sparse, and noisy nature of SKGs, and overcoming the state of the art in terms of AUC (for link prediction), and recall and nDCG (for recommendation).

Data Search in Practice: How to find Scientific Datasets and link them to the Literature

IRRERA, ORNELLA
2025

Abstract

The rise of Open Science, which aims to enhance the accessibility, transparency, and collaboration in scientific research, has led to an exponential increase in the volume of published datasets. In this evolving landscape, tasks such as data publishing and data citation are essential to ensure the long-term persistence, accessibility, and discoverability of research datasets and their authors. However, the lack of a universal agreement on how to publish and cite data caused the coexistence of a wide range of publication and citation practices, including a variety of approaches to defining metadata --containing information about provenance, date, description, authors-- which often are noisy, incomplete, and under-represent the related dataset. SKGs --i.e., large, heterogeneous graphs that semantically link research outputs-- play a crucial role in organizing and structuring the increasing volume of scholarly knowledge. On the other hand, these graphs are not curated, compromising their reliability and affecting their potential for reuse. Furthermore, the presence of poor quality metadata results in numerous isolated or weakly connected nodes within the SKGs. In this context, link prediction and dataset recommendation methods become critical for enhancing graph connectivity, improving the discoverability of datasets, and increasing the visibility and impact of their authors. These tasks help mitigate the challenges posed by incomplete metadata, promoting the reusability of research data and advancing scientific collaboration. In this thesis, we address the aforementioned challenges through three key research directions. First, we introduce two strategies to perform scholarly data curation; the first one is a curation pipeline that we applied on a research community within the OAG to curate and validate nodes and relationships; while the second one is an annotation tool for textual documents useful to create large annotated ground truth corpora and curate scholarly data, identifying new data citations, and new relationships between different outcomes. Second, we use this curated graph to analyze prevalent citation patterns in scholarly data, examining how various citation practices impact discoverability and credit attribution. Lastly, we replicate, reproduce and generalize some of the most recent dataset recommendation methods discussing their pros and cons, and we leverage the derived observation to design SAN, a novel GRL method designed to perform link prediction and dataset recommendation in real-world scenarios, effectively handling the heterogeneous, sparse, and noisy nature of SKGs, and overcoming the state of the art in terms of AUC (for link prediction), and recall and nDCG (for recommendation).
20-mar-2025
Inglese
SILVELLO, GIANMARIA
Università degli studi di Padova
File in questo prodotto:
File Dimensione Formato  
tesi_definitiva_ornella_irrera.pdf

accesso aperto

Dimensione 17.11 MB
Formato Adobe PDF
17.11 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/202452
Il codice NBN di questa tesi è URN:NBN:IT:UNIPD-202452