A cluster computing approach applied to machine learning for earth observation big data analysis

Iannitto, Giuseppe

Machine learning and Deep Learning techniques are becoming increasingly popular as a promising approach for analysis of large business and scientific data sets for developing more sophisticated models for different intelligent applications such as image recognition, speech categorization, and automatic machine translation. Achieving high accurate data models requires massive amount of input training data, however most of the current machine learning tools still try to scale up both storage and processing on a single machine, to be able to process such large amount of data, along computation intensive models. The execution time of training phase of an accurate and efficient model could become a bottleneck when the data size scales, since the enormous computation and data input are still a huge burden to a single machine. Research scientists need to setup the network with initial configurations and wait for a long time to get back the trained model, which reduces the performance in the model training process. In order to solve this problem, there are frameworks and solutions that are utilizing accelerators such as GPU [18] and FPGA[19], for doing the computation part and offload certain operations onto GPU cores. Unfortunately, a single system can scale-up to some extents, which makes some limitation for the feasibility of current frameworks. On the other hand, there are frameworks as Spark [20] that distributes computing tasks across the nodes of a cluster. To find a possible solutions for reducing the execution time of the training phase of an accurate and efficient model in machine learning environments, this work proposes a distributed machine-learning framework based on Apache Spark and a cluster of distributed CPUs (no GPUs) to build implementations of Neural Network computational models to demonstrate that this scalable architecture provides a training speedup that increases uniformly with increase in the number of worker nodes. The key contributions of this approach are: • Leveraging distributed feature in Spark, this library can scale out both dataset and computation, by distributing both data and operations across the cluster nodes. This library targets a Multilayer Perceptron Neural Network computational model for an automatic pixel-based image classification aimed to cloud detection from iii Word Template by Friedman & Morgan 2014 LANDASAT 8 Satellite images and a Volcanic ash mass retrieval using MODIS data. The generation of training data, the training of the network, and the classification of an image are implemented by applying parallel programming primitives to multidimensional arrays of data, which are distributed across a computer cluster nodes. • In addition to a complete description of the Multilayer Perceptron Neural Network model implementations, this work also provides a computational benchmark comparing different machine learning frameworks (Spark MLlib, Tensorflow and BigDL) and evaluate the Spark cluster scalability using the speed up metric. The results show that the Spark-based system obtains near-linear scalability in its distributed configuration for the tested dataset: by adding two or three more nodes into the system it is possible to reduce the running time up to 2-3 times of the original stand-alone single node configuration. • Parameter synchronization is a performance critical operation for data parallel distributed model training (in terms of speed and scalability). BigDL, a distributed machine learning framework for big data platforms and workflows, has been used as a library on top of Apache Spark, because it has been demonstrated that, unlike existing machine learning frameworks, it can efficiently train large machine neural network across large (e.g., hundreds of servers) clusters. BigDL allows users to build machine learning applications for Earth Observation Big Data using a single unified data pipeline; the entire pipeline can directly run on top of existing big data systems as Apache Spark in a distributed fashion. • The results have demonstrated that the use of a unified data analytics and machine learning system like BigDL over Spark improve the ease of use (including development, deployment and operations) and enhances the performance of machine learning process in the Earth Observation big data context, where for achieving high accurate data models it is required massive amount of input training data.

A cluster computing approach applied to machine learning for earth observation big data analysis

IANNITTO, GIUSEPPE

2020

Abstract

Machine learning and Deep Learning techniques are becoming increasingly popular as a promising approach for analysis of large business and scientific data sets for developing more sophisticated models for different intelligent applications such as image recognition, speech categorization, and automatic machine translation. Achieving high accurate data models requires massive amount of input training data, however most of the current machine learning tools still try to scale up both storage and processing on a single machine, to be able to process such large amount of data, along computation intensive models. The execution time of training phase of an accurate and efficient model could become a bottleneck when the data size scales, since the enormous computation and data input are still a huge burden to a single machine. Research scientists need to setup the network with initial configurations and wait for a long time to get back the trained model, which reduces the performance in the model training process. In order to solve this problem, there are frameworks and solutions that are utilizing accelerators such as GPU [18] and FPGA[19], for doing the computation part and offload certain operations onto GPU cores. Unfortunately, a single system can scale-up to some extents, which makes some limitation for the feasibility of current frameworks. On the other hand, there are frameworks as Spark [20] that distributes computing tasks across the nodes of a cluster. To find a possible solutions for reducing the execution time of the training phase of an accurate and efficient model in machine learning environments, this work proposes a distributed machine-learning framework based on Apache Spark and a cluster of distributed CPUs (no GPUs) to build implementations of Neural Network computational models to demonstrate that this scalable architecture provides a training speedup that increases uniformly with increase in the number of worker nodes. The key contributions of this approach are: • Leveraging distributed feature in Spark, this library can scale out both dataset and computation, by distributing both data and operations across the cluster nodes. This library targets a Multilayer Perceptron Neural Network computational model for an automatic pixel-based image classification aimed to cloud detection from iii Word Template by Friedman & Morgan 2014 LANDASAT 8 Satellite images and a Volcanic ash mass retrieval using MODIS data. The generation of training data, the training of the network, and the classification of an image are implemented by applying parallel programming primitives to multidimensional arrays of data, which are distributed across a computer cluster nodes. • In addition to a complete description of the Multilayer Perceptron Neural Network model implementations, this work also provides a computational benchmark comparing different machine learning frameworks (Spark MLlib, Tensorflow and BigDL) and evaluate the Spark cluster scalability using the speed up metric. The results show that the Spark-based system obtains near-linear scalability in its distributed configuration for the tested dataset: by adding two or three more nodes into the system it is possible to reduce the running time up to 2-3 times of the original stand-alone single node configuration. • Parameter synchronization is a performance critical operation for data parallel distributed model training (in terms of speed and scalability). BigDL, a distributed machine learning framework for big data platforms and workflows, has been used as a library on top of Apache Spark, because it has been demonstrated that, unlike existing machine learning frameworks, it can efficiently train large machine neural network across large (e.g., hundreds of servers) clusters. BigDL allows users to build machine learning applications for Earth Observation Big Data using a single unified data pipeline; the entire pipeline can directly run on top of existing big data systems as Apache Spark in a distributed fashion. • The results have demonstrated that the use of a unified data analytics and machine learning system like BigDL over Spark improve the ease of use (including development, deployment and operations) and enhances the performance of machine learning process in the Earth Observation big data context, where for achieving high accurate data models it is required massive amount of input training data.

Scheda breve

Scheda completa

Scheda completa (DC)

	Corso di studio
	
				Computer science, control and geoinformation
			
	Data di pubblicazione
	
				2020
			
	Lingua
	
				Inglese
			
	Correlatore, Controrelatore, Co-Supervisor,  Co-Tutor o Coordinatori
	
				DEL FRATE, FABIO
			
	Nome Editore
	
				Università degli Studi di Roma "Tor Vergata"
			
	Collezione di appartenenza
	
				Università degli Studi di Roma Tor Vergata

File in questo prodotto:

File	Dimensione	Formato
PhD_Dissertation_Iannitto.pdf accesso solo da BNCF e BNCR Dimensione 2.28 MB Formato Adobe PDF	2.28 MB	Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/214361

Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA2-214361