The rise of Open Science, which aims to enhance the accessibility, transparency, and collaboration in scientific research, has led to an exponential increase in the volume of published datasets. In this evolving landscape, tasks such as data publishing and data citation are essential to ensure the long-term persistence, accessibility, and discoverability of research datasets and their authors. However, the lack of a universal agreement on how to publish and cite data caused the coexistence of a wide range of publication and citation practices, including a variety of approaches to defining metadata --containing information about provenance, date, description, authors-- which often are noisy, incomplete, and under-represent the related dataset. SKGs --i.e., large, heterogeneous graphs that semantically link research outputs-- play a crucial role in organizing and structuring the increasing volume of scholarly knowledge. On the other hand, these graphs are not curated, compromising their reliability and affecting their potential for reuse. Furthermore, the presence of poor quality metadata results in numerous isolated or weakly connected nodes within the SKGs. In this context, link prediction and dataset recommendation methods become critical for enhancing graph connectivity, improving the discoverability of datasets, and increasing the visibility and impact of their authors. These tasks help mitigate the challenges posed by incomplete metadata, promoting the reusability of research data and advancing scientific collaboration. In this thesis, we address the aforementioned challenges through three key research directions. First, we introduce two strategies to perform scholarly data curation; the first one is a curation pipeline that we applied on a research community within the OAG to curate and validate nodes and relationships; while the second one is an annotation tool for textual documents useful to create large annotated ground truth corpora and curate scholarly data, identifying new data citations, and new relationships between different outcomes. Second, we use this curated graph to analyze prevalent citation patterns in scholarly data, examining how various citation practices impact discoverability and credit attribution. Lastly, we replicate, reproduce and generalize some of the most recent dataset recommendation methods discussing their pros and cons, and we leverage the derived observation to design SAN, a novel GRL method designed to perform link prediction and dataset recommendation in real-world scenarios, effectively handling the heterogeneous, sparse, and noisy nature of SKGs, and overcoming the state of the art in terms of AUC (for link prediction), and recall and nDCG (for recommendation).
Data Search in Practice: How to find Scientific Datasets and link them to the Literature
IRRERA, ORNELLA
2025
Abstract
The rise of Open Science, which aims to enhance the accessibility, transparency, and collaboration in scientific research, has led to an exponential increase in the volume of published datasets. In this evolving landscape, tasks such as data publishing and data citation are essential to ensure the long-term persistence, accessibility, and discoverability of research datasets and their authors. However, the lack of a universal agreement on how to publish and cite data caused the coexistence of a wide range of publication and citation practices, including a variety of approaches to defining metadata --containing information about provenance, date, description, authors-- which often are noisy, incomplete, and under-represent the related dataset. SKGs --i.e., large, heterogeneous graphs that semantically link research outputs-- play a crucial role in organizing and structuring the increasing volume of scholarly knowledge. On the other hand, these graphs are not curated, compromising their reliability and affecting their potential for reuse. Furthermore, the presence of poor quality metadata results in numerous isolated or weakly connected nodes within the SKGs. In this context, link prediction and dataset recommendation methods become critical for enhancing graph connectivity, improving the discoverability of datasets, and increasing the visibility and impact of their authors. These tasks help mitigate the challenges posed by incomplete metadata, promoting the reusability of research data and advancing scientific collaboration. In this thesis, we address the aforementioned challenges through three key research directions. First, we introduce two strategies to perform scholarly data curation; the first one is a curation pipeline that we applied on a research community within the OAG to curate and validate nodes and relationships; while the second one is an annotation tool for textual documents useful to create large annotated ground truth corpora and curate scholarly data, identifying new data citations, and new relationships between different outcomes. Second, we use this curated graph to analyze prevalent citation patterns in scholarly data, examining how various citation practices impact discoverability and credit attribution. Lastly, we replicate, reproduce and generalize some of the most recent dataset recommendation methods discussing their pros and cons, and we leverage the derived observation to design SAN, a novel GRL method designed to perform link prediction and dataset recommendation in real-world scenarios, effectively handling the heterogeneous, sparse, and noisy nature of SKGs, and overcoming the state of the art in terms of AUC (for link prediction), and recall and nDCG (for recommendation).File | Dimensione | Formato | |
---|---|---|---|
tesi_definitiva_ornella_irrera.pdf
accesso aperto
Dimensione
17.11 MB
Formato
Adobe PDF
|
17.11 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/202452
URN:NBN:IT:UNIPD-202452