Big data and the web: algorithms for data intensive scalable computing

De Francisci Morales, Gianmarco

This thesis explores the problem of large scale Web mining by using Data Intensive Scalable Computing (DISC) systems. Web mining aims to extract useful information and models from data on the Web, the largest repository ever created. DISC systems are an emerging technology for processing huge datasets in parallel on large computer clusters. Challenges arise from both themes of research. The Web is heterogeneous: data lives in various formats that are best modeled in different ways. Effectively extracting information requires careful design of algorithms for specific categories of data. TheWeb is huge, but DISC systems offer a platform for building scalable solutions. However, they provide restricted computing primitives for the sake of performance. Efficiently harnessing the power of parallelism offered by DISC systems involves rethinking traditional algorithms. This thesis tackles three classical problems in Web mining. First we propose a novel solution to finding similar items in a bag of Web pages. Second we consider how to effectively distribute content from Web 2.0 to users via graph matching. Third we show how to harness the streams from the real-time Web to suggest news articles. Our main contribution lies in rethinking these problems in the context of massive scaleWeb mining, and in designing efficient MapReduce and streaming algorithms to solve these problems on DISC systems.

Big data and the web: algorithms for data intensive scalable computing

De Francisci Morales, Gianmarco

2012

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

	Data di pubblicazione
	
				2012
			
	Lingua
	
				Inglese
			
	Parola chiave
	
				QA75 Electronic computers. Computer science
			
	Relatore, Supervisor, Advisor o Tutor
	
				Lucchese, Dr. Claudio
			
	Nome Editore
	
				Scuola IMT Alti Studi di Lucca
			
	Collezione di appartenenza
	
				Scuola IMT Alti Studi di Lucca

File in questo prodotto:

File	Dimensione	Formato
De%20Francisci_phdthesis.pdf accesso aperto Tipologia: Altro materiale allegato Licenza: Tutti i diritti riservati Dimensione 2.27 MB Formato Adobe PDF Visualizza/Apri	2.27 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/144197

Il codice NBN di questa tesi è URN:NBN:IT:IMTLUCCA-144197