The Compact Muon Solenoid (CMS) experiment at the European Organization for Nuclear Research (CERN) deploys its data collections, simulation and analysis activities on a distributed computing infrastructure involving more than 70 sites worldwide. Over the last few years, the historical usage data produced by this large infrastructure has been recorded on Big Data clusters featuring more than 5 Petabytes of raw storage with different open-source user-level tools available for analytical purposes. The clusters offer a broad variety of computing and storage logs that represent a valuable, yet scarcely investigated, source of information for system tuning and capacity planning. Amongst all, the problem of understanding and predicting dataset popularity is of primary interest for CERN. In fact, its solution can enable effective policies for the placement of mostly accessed datasets, thus resulting in remarkably shorter job latencies and increased system throughput and resource usage. In this thesis, three key requirements for Petabyte-size dataset popularity models in a worldwide computing system such as the Worldwide LHC Computing Grid (WLCG) are investigated. Namely, the need of an efficient Hadoop data vault for an effective mining platform capable of collecting computing logs from different monitoring services and organizing them into periodic snapshots; the need of a scalable pipeline of machine learning tools for training, on these snapshots, predictive models able to forecast which datasets will become popular over time, thus discovering patterns and correlations useful to enhance the overall efficiency of the distributed infrastructure; the need of a novel caching policy based on the dataset popularity predictions than can outperform the current dataset replacement implementation. The main contributions of this thesis include the following results: 1. we propose and implement a scalable machine learning pipeline, built on top of the CMS Hadoop data store, to predict the popularity of new and existing datasets accessed by jobs processing any of the 25 event types stored in the distributed CMS infrastructure. We cast the problem of predicting dataset accesses to a binary classification problem where we are interested to forecast if a given dataset will be popular or not at time slot t in a given CMS site s, i.e., it will be accessed for more than x times during time slot t at site s. Our experiments show that the proposed predictive models reach very satisfying accuracy, indicating the ability to correctly separate popular datasets from unpopular ones. 2. We propose a novel intelligent data caching policy, named PPC (Popularity Prediction Caching). This caching strategy exploits the popularity predictions achieved with our best performing classifier to optimize the eviction policy implemented at each site of the CMS infrastructure. We assess the effectiveness of this caching policy by measuring the hit rates achieved by PPC and caching baselines such as LRU (Least Recently Used) in managing the dataset access requests over a two-year timespan at 6 CMS sites. The results of our simulation show that PPC outperforms LRU reducing the number of cache misses up to 20% in some sites.

Mining Predictive Models for Big Data Placement

2018

Abstract

The Compact Muon Solenoid (CMS) experiment at the European Organization for Nuclear Research (CERN) deploys its data collections, simulation and analysis activities on a distributed computing infrastructure involving more than 70 sites worldwide. Over the last few years, the historical usage data produced by this large infrastructure has been recorded on Big Data clusters featuring more than 5 Petabytes of raw storage with different open-source user-level tools available for analytical purposes. The clusters offer a broad variety of computing and storage logs that represent a valuable, yet scarcely investigated, source of information for system tuning and capacity planning. Amongst all, the problem of understanding and predicting dataset popularity is of primary interest for CERN. In fact, its solution can enable effective policies for the placement of mostly accessed datasets, thus resulting in remarkably shorter job latencies and increased system throughput and resource usage. In this thesis, three key requirements for Petabyte-size dataset popularity models in a worldwide computing system such as the Worldwide LHC Computing Grid (WLCG) are investigated. Namely, the need of an efficient Hadoop data vault for an effective mining platform capable of collecting computing logs from different monitoring services and organizing them into periodic snapshots; the need of a scalable pipeline of machine learning tools for training, on these snapshots, predictive models able to forecast which datasets will become popular over time, thus discovering patterns and correlations useful to enhance the overall efficiency of the distributed infrastructure; the need of a novel caching policy based on the dataset popularity predictions than can outperform the current dataset replacement implementation. The main contributions of this thesis include the following results: 1. we propose and implement a scalable machine learning pipeline, built on top of the CMS Hadoop data store, to predict the popularity of new and existing datasets accessed by jobs processing any of the 25 event types stored in the distributed CMS infrastructure. We cast the problem of predicting dataset accesses to a binary classification problem where we are interested to forecast if a given dataset will be popular or not at time slot t in a given CMS site s, i.e., it will be accessed for more than x times during time slot t at site s. Our experiments show that the proposed predictive models reach very satisfying accuracy, indicating the ability to correctly separate popular datasets from unpopular ones. 2. We propose a novel intelligent data caching policy, named PPC (Popularity Prediction Caching). This caching strategy exploits the popularity predictions achieved with our best performing classifier to optimize the eviction policy implemented at each site of the CMS infrastructure. We assess the effectiveness of this caching policy by measuring the hit rates achieved by PPC and caching baselines such as LRU (Least Recently Used) in managing the dataset access requests over a two-year timespan at 6 CMS sites. The results of our simulation show that PPC outperforms LRU reducing the number of cache misses up to 20% in some sites.
17-ott-2018
Italiano
Perego, Raffaele
Tonellotto, Nicola
Università degli Studi di Pisa
File in questo prodotto:
File Dimensione Formato  
marcomeoni.pdf

accesso aperto

Tipologia: Altro materiale allegato
Dimensione 3.57 MB
Formato Adobe PDF
3.57 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/132788
Il codice NBN di questa tesi è URN:NBN:IT:UNIPI-132788