This thesis explores the problem of large scale Web mining by using Data Intensive Scalable Computing (DISC) systems. Web mining aims to extract useful information and models from data on the Web, the largest repository ever created. DISC systems are an emerging technology for processing huge datasets in parallel on large computer clusters. Challenges arise from both themes of research. The Web is heterogeneous: data lives in various formats that are best modeled in different ways. Effectively extracting information requires careful design of algorithms for specific categories of data. TheWeb is huge, but DISC systems offer a platform for building scalable solutions. However, they provide restricted computing primitives for the sake of performance. Efficiently harnessing the power of parallelism offered by DISC systems involves rethinking traditional algorithms. This thesis tackles three classical problems in Web mining. First we propose a novel solution to finding similar items in a bag of Web pages. Second we consider how to effectively distribute content from Web 2.0 to users via graph matching. Third we show how to harness the streams from the real-time Web to suggest news articles. Our main contribution lies in rethinking these problems in the context of massive scaleWeb mining, and in designing efficient MapReduce and streaming algorithms to solve these problems on DISC systems.

Big data and the web: algorithms for data intensive scalable computing

2012

Abstract

This thesis explores the problem of large scale Web mining by using Data Intensive Scalable Computing (DISC) systems. Web mining aims to extract useful information and models from data on the Web, the largest repository ever created. DISC systems are an emerging technology for processing huge datasets in parallel on large computer clusters. Challenges arise from both themes of research. The Web is heterogeneous: data lives in various formats that are best modeled in different ways. Effectively extracting information requires careful design of algorithms for specific categories of data. TheWeb is huge, but DISC systems offer a platform for building scalable solutions. However, they provide restricted computing primitives for the sake of performance. Efficiently harnessing the power of parallelism offered by DISC systems involves rethinking traditional algorithms. This thesis tackles three classical problems in Web mining. First we propose a novel solution to finding similar items in a bag of Web pages. Second we consider how to effectively distribute content from Web 2.0 to users via graph matching. Third we show how to harness the streams from the real-time Web to suggest news articles. Our main contribution lies in rethinking these problems in the context of massive scaleWeb mining, and in designing efficient MapReduce and streaming algorithms to solve these problems on DISC systems.
2012
Inglese
QA75 Electronic computers. Computer science
Lucchese, Dr. Claudio
Scuola IMT Alti Studi di Lucca
File in questo prodotto:
File Dimensione Formato  
De%20Francisci_phdthesis.pdf

accesso aperto

Tipologia: Altro materiale allegato
Dimensione 2.27 MB
Formato Adobe PDF
2.27 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/144197
Il codice NBN di questa tesi è URN:NBN:IT:IMTLUCCA-144197