The Dao of Wikipedia: Extracting Knowledge from the Structure of Wikilinks

Consonni, Cristian

Wikipedia is a multilingual encyclopedia written collaboratively by volunteers online, and it is now the largest, most visited encyclopedia in existence. Wikipedia has arisen through the self-organized collaboration of contributors, and since its launch in January 2001, its potential as a research resource has become apparent to scientists, its appeal lies in the fact that it strikes a middle ground between accurate, manually created, limited-coverage resources, and noisy knowledge mined from the web. For this reason, Wikipedia's content has been exploited for a variety of applications: to build knowledge bases, to study interactions between users on the Internet, and to investigate social and cultural issues such as gender bias in history, or the spreading of information. Similarly to what happened for the Web at large, a structure has emerged from the collaborative creation of Wikipedia: its articles contain hundreds of millions of links. In Wikipedia parlance, these internal links are called wikilinks. These connections explain the topics being covered in articles and provide a way to navigate between different subjects, contextualizing the information, and making additional information available. In this thesis, we argue that the information contained in the link structure of Wikipedia can be harnessed to gain useful insights by extracting it with dedicated algorithms. More prosaically, in this thesis, we explore the link structure of Wikipedia with new methods. In the first part, we discuss in depth the characteristics of Wikipedia, and we describe the process and challenges we have faced to extract the network of links. Since Wikipedia is available in several language editions and its entire edition history is publicly available, we have extracted the wikilink network at various points in time, and we have performed data integration to improve its quality. In the second part, we show that the wikilink network can be effectively used to find the most relevant pages related to an article provided by the user. We introduce a novel algorithm, called CycleRank, that takes advantage of the link structure of Wikipedia considering cycles of links, thus giving weight to both incoming and outgoing connections, to produce a ranking of articles with respect to an article chosen by the user. In the last part, we explore applications of CycleRank. First, we describe the Engineroom EU project, where we faced the challenge to find which were the most relevant Wikipedia pages connected to the Wikipedia article about the Internet. Finally, we present another contribution using Wikipedia article accesses to estimate how the information about diseases propagates. In conclusion, with this thesis, we wanted to show that browsing Wikipedia's wikilinks is not only fascinating and serendipitous, but it is an effective way to extract useful information that is latent in the user-generated encyclopedia.

The Dao of Wikipedia: Extracting Knowledge from the Structure of Wikilinks

Consonni, Cristian

2019

Abstract

Wikipedia is a multilingual encyclopedia written collaboratively by volunteers online, and it is now the largest, most visited encyclopedia in existence. Wikipedia has arisen through the self-organized collaboration of contributors, and since its launch in January 2001, its potential as a research resource has become apparent to scientists, its appeal lies in the fact that it strikes a middle ground between accurate, manually created, limited-coverage resources, and noisy knowledge mined from the web. For this reason, Wikipedia's content has been exploited for a variety of applications: to build knowledge bases, to study interactions between users on the Internet, and to investigate social and cultural issues such as gender bias in history, or the spreading of information. Similarly to what happened for the Web at large, a structure has emerged from the collaborative creation of Wikipedia: its articles contain hundreds of millions of links. In Wikipedia parlance, these internal links are called wikilinks. These connections explain the topics being covered in articles and provide a way to navigate between different subjects, contextualizing the information, and making additional information available. In this thesis, we argue that the information contained in the link structure of Wikipedia can be harnessed to gain useful insights by extracting it with dedicated algorithms. More prosaically, in this thesis, we explore the link structure of Wikipedia with new methods. In the first part, we discuss in depth the characteristics of Wikipedia, and we describe the process and challenges we have faced to extract the network of links. Since Wikipedia is available in several language editions and its entire edition history is publicly available, we have extracted the wikilink network at various points in time, and we have performed data integration to improve its quality. In the second part, we show that the wikilink network can be effectively used to find the most relevant pages related to an article provided by the user. We introduce a novel algorithm, called CycleRank, that takes advantage of the link structure of Wikipedia considering cycles of links, thus giving weight to both incoming and outgoing connections, to produce a ranking of articles with respect to an article chosen by the user. In the last part, we explore applications of CycleRank. First, we describe the Engineroom EU project, where we faced the challenge to find which were the most relevant Wikipedia pages connected to the Wikipedia article about the Internet. Finally, we present another contribution using Wikipedia article accesses to estimate how the information about diseases propagates. In conclusion, with this thesis, we wanted to show that browsing Wikipedia's wikilinks is not only fascinating and serendipitous, but it is an effective way to extract useful information that is latent in the user-generated encyclopedia.

Scheda breve

Scheda completa

Scheda completa (DC)

	Facoltà/Dipartimento
	
				Ingegneria e scienza dell'Informaz (29/10/12-)
			
	Corso di studio
	
				Information and Communication Technology
			
	Data di pubblicazione
	
				24-ott-2019
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				Montresor, Alberto
			
	Correlatore, Controrelatore, Co-Supervisor,  Co-Tutor o Coordinatori
	
				Velegrakis, Ioannis
			
	Nome Editore
	
				Università degli studi di Trento
			
	Città Editore
	
				Trento
			
	Numero di pagine
	
				184
			
	Collezione di appartenenza
	
				Università degli Studi di Trento

File in questo prodotto:

File	Dimensione	Formato
main.pdf accesso aperto Dimensione 16.1 MB Formato Adobe PDF Visualizza/Apri	16.1 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/102630

Il codice NBN di questa tesi è URN:NBN:IT:UNITN-102630