Towards a FAIR Open Data Ecosystem using AI and CI

Ahmed, Umair

Open data plays a crucial role in modern digital governance by enabling transparency, fostering innovation, and supporting data-driven decision making. Despite the growing prevalence of open data ecosystems, they still remain supplier-driven, exclusive, linear, and best-effort-based. Contempo rary governance principles call for their evolution into a more user-driven, inclusive, circular, and skills-based model. Beyond general principles of the ecosystem, their effectiveness is also gauged by adherence to the FAIR data principles: Findability, Accessibility, Interoperability, and Reusability. How ever, many existing open data portals struggle to fully implement these prin ciples. This thesis proposes a multilayered approach to enhancing FAIRness within the open data ecosystem by leveraging analytical and generative capa bilities of Artificial Intelligence (AI), alongside the participatory frameworks enabled by Collective Intelligence (CI). The work is structured around the four core FAIR principles. Under Findability and Accessibility, the thesis first explores automated meta data generation using a suite of fine-tuned LLMs (Flan-T5, Llama, Mistral, Qwen, Deepseek, GPT) for generating keywords, categories, and descrip tions. It also implements a use case for semantic search system built upon knowledge graphs and retrieval augmented generation to enable natural lan guage access to data. For Interoperability, it explores how LLMs can support ontology alignment using few-shot prompting and construction of knowledge graphs through entity extraction from unstructured medical texts. In Reusability, the thesis draws inspiration from successful cases of open source crowdsourcing initiatives such as OpenStreetMaps (OSM), Wikidata, and Github to propose a collective intelligence framework for dataset revi sioning and feedback mechanisms suited to open data portals Finally, this research envisions a hybrid AI and CI infrastructure where a dataset uploaded by a publisher is automatically enriched with high-quality metadata generated by relevant LLMs. Moreover, it proposes a semantic layer built on top of this metadata generation process, enabling users to query datasets directly in natural language. Apart from AI, the CI layer provides a framework for the community to iteratively improve data and metadata through feedback and revisioning. This architecture and vision advance the open data ecosystem towards being more FAIR: intelligent, participatory, and self-improving.

Towards a FAIR Open Data Ecosystem using AI and CI

AHMED, UMAIR

2025

Abstract

Open data plays a crucial role in modern digital governance by enabling transparency, fostering innovation, and supporting data-driven decision making. Despite the growing prevalence of open data ecosystems, they still remain supplier-driven, exclusive, linear, and best-effort-based. Contempo rary governance principles call for their evolution into a more user-driven, inclusive, circular, and skills-based model. Beyond general principles of the ecosystem, their effectiveness is also gauged by adherence to the FAIR data principles: Findability, Accessibility, Interoperability, and Reusability. How ever, many existing open data portals struggle to fully implement these prin ciples. This thesis proposes a multilayered approach to enhancing FAIRness within the open data ecosystem by leveraging analytical and generative capa bilities of Artificial Intelligence (AI), alongside the participatory frameworks enabled by Collective Intelligence (CI). The work is structured around the four core FAIR principles. Under Findability and Accessibility, the thesis first explores automated meta data generation using a suite of fine-tuned LLMs (Flan-T5, Llama, Mistral, Qwen, Deepseek, GPT) for generating keywords, categories, and descrip tions. It also implements a use case for semantic search system built upon knowledge graphs and retrieval augmented generation to enable natural lan guage access to data. For Interoperability, it explores how LLMs can support ontology alignment using few-shot prompting and construction of knowledge graphs through entity extraction from unstructured medical texts. In Reusability, the thesis draws inspiration from successful cases of open source crowdsourcing initiatives such as OpenStreetMaps (OSM), Wikidata, and Github to propose a collective intelligence framework for dataset revi sioning and feedback mechanisms suited to open data portals Finally, this research envisions a hybrid AI and CI infrastructure where a dataset uploaded by a publisher is automatically enriched with high-quality metadata generated by relevant LLMs. Moreover, it proposes a semantic layer built on top of this metadata generation process, enabling users to query datasets directly in natural language. Apart from AI, the CI layer provides a framework for the community to iteratively improve data and metadata through feedback and revisioning. This architecture and vision advance the open data ecosystem towards being more FAIR: intelligent, participatory, and self-improving.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				Scuola di Scienze e Tecnologie
			
	Corso di studio
	
				Doctoral course in Computer Science and Mathematics
			
	Data di pubblicazione
	
				1-dic-2025
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				Andrea Polini
POLINI, Andrea
			
	Nome Editore
	
				Università degli Studi di Camerino
			
	Collezione di appartenenza
	
				Università degli Studi di Camerino

File in questo prodotto:

File	Dimensione	Formato
thesis ahmed umair def.pdf embargo fino al 01/12/2026 Licenza: Tutti i diritti riservati Dimensione 7.45 MB Formato Adobe PDF	7.45 MB	Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/365034

Il codice NBN di questa tesi è URN:NBN:IT:UNICAM-365034