Data Search in Practice: How to find Scientific Datasets and link them to the Literature

Irrera, Ornella

The rise of Open Science, which aims to enhance the accessibility, transparency, and collaboration in scientific research, has led to an exponential increase in the volume of published datasets. In this evolving landscape, tasks such as data publishing and data citation are essential to ensure the long-term persistence, accessibility, and discoverability of research datasets and their authors. However, the lack of a universal agreement on how to publish and cite data caused the coexistence of a wide range of publication and citation practices, including a variety of approaches to defining metadata --containing information about provenance, date, description, authors-- which often are noisy, incomplete, and under-represent the related dataset. SKGs --i.e., large, heterogeneous graphs that semantically link research outputs-- play a crucial role in organizing and structuring the increasing volume of scholarly knowledge. On the other hand, these graphs are not curated, compromising their reliability and affecting their potential for reuse. Furthermore, the presence of poor quality metadata results in numerous isolated or weakly connected nodes within the SKGs. In this context, link prediction and dataset recommendation methods become critical for enhancing graph connectivity, improving the discoverability of datasets, and increasing the visibility and impact of their authors. These tasks help mitigate the challenges posed by incomplete metadata, promoting the reusability of research data and advancing scientific collaboration. In this thesis, we address the aforementioned challenges through three key research directions. First, we introduce two strategies to perform scholarly data curation; the first one is a curation pipeline that we applied on a research community within the OAG to curate and validate nodes and relationships; while the second one is an annotation tool for textual documents useful to create large annotated ground truth corpora and curate scholarly data, identifying new data citations, and new relationships between different outcomes. Second, we use this curated graph to analyze prevalent citation patterns in scholarly data, examining how various citation practices impact discoverability and credit attribution. Lastly, we replicate, reproduce and generalize some of the most recent dataset recommendation methods discussing their pros and cons, and we leverage the derived observation to design SAN, a novel GRL method designed to perform link prediction and dataset recommendation in real-world scenarios, effectively handling the heterogeneous, sparse, and noisy nature of SKGs, and overcoming the state of the art in terms of AUC (for link prediction), and recall and nDCG (for recommendation).

Data Search in Practice: How to find Scientific Datasets and link them to the Literature

IRRERA, ORNELLA

2025

Abstract

The rise of Open Science, which aims to enhance the accessibility, transparency, and collaboration in scientific research, has led to an exponential increase in the volume of published datasets. In this evolving landscape, tasks such as data publishing and data citation are essential to ensure the long-term persistence, accessibility, and discoverability of research datasets and their authors. However, the lack of a universal agreement on how to publish and cite data caused the coexistence of a wide range of publication and citation practices, including a variety of approaches to defining metadata --containing information about provenance, date, description, authors-- which often are noisy, incomplete, and under-represent the related dataset. SKGs --i.e., large, heterogeneous graphs that semantically link research outputs-- play a crucial role in organizing and structuring the increasing volume of scholarly knowledge. On the other hand, these graphs are not curated, compromising their reliability and affecting their potential for reuse. Furthermore, the presence of poor quality metadata results in numerous isolated or weakly connected nodes within the SKGs. In this context, link prediction and dataset recommendation methods become critical for enhancing graph connectivity, improving the discoverability of datasets, and increasing the visibility and impact of their authors. These tasks help mitigate the challenges posed by incomplete metadata, promoting the reusability of research data and advancing scientific collaboration. In this thesis, we address the aforementioned challenges through three key research directions. First, we introduce two strategies to perform scholarly data curation; the first one is a curation pipeline that we applied on a research community within the OAG to curate and validate nodes and relationships; while the second one is an annotation tool for textual documents useful to create large annotated ground truth corpora and curate scholarly data, identifying new data citations, and new relationships between different outcomes. Second, we use this curated graph to analyze prevalent citation patterns in scholarly data, examining how various citation practices impact discoverability and credit attribution. Lastly, we replicate, reproduce and generalize some of the most recent dataset recommendation methods discussing their pros and cons, and we leverage the derived observation to design SAN, a novel GRL method designed to perform link prediction and dataset recommendation in real-world scenarios, effectively handling the heterogeneous, sparse, and noisy nature of SKGs, and overcoming the state of the art in terms of AUC (for link prediction), and recall and nDCG (for recommendation).

Scheda breve

Scheda completa

Scheda completa (DC)

	Corso di studio
	
				INGEGNERIA DELL'INFORMAZIONE
			
	Data di pubblicazione
	
				20-mar-2025
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				SILVELLO, GIANMARIA
			
	Nome Editore
	
				Università degli studi di Padova
			
	Collezione di appartenenza
	
				Università degli Studi di Padova

File in questo prodotto:

File	Dimensione	Formato
tesi_definitiva_ornella_irrera.pdf accesso aperto Licenza: Tutti i diritti riservati Dimensione 17.11 MB Formato Adobe PDF Visualizza/Apri	17.11 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/202452

Il codice NBN di questa tesi è URN:NBN:IT:UNIPD-202452