Model-based auto-scaling of distributed data stream processing applications

Russo Russo, Gabriele

The ubiquitous presence of smart devices at the edge of the network (e.g., wearable devices, smartphones, Internet-of-Things sensors) fosters an unending growth in the amount of daily collected data. As data volume grows, so does the number of datadriven services and applications, whose functionality depends on the ability of extracting valuable information from raw data, usually in near real-time. In this context, distributed Data Stream Processing (DSP) applications play a key role, enabling the analysis of fast data streams with very low latency. The core idea behind DSP applications is processing data as soon as they are available (i.e., without storing them). For this purpose, data are streamed through a network of so-called operators, which apply speci c transformations (e.g., ltering) or computations (e.g., pattern-matching) against input data. By executing operators over multiple distributed nodes, DSP applications can cope with high-volume data streams. Meeting application performance requirements (e.g., maximum processing latency) in modern, highly distributed computing environments poses several challenges. First, applications usually face variable, unpredictable workloads, which call for dynamic resource allocation mechanisms, to sustain the incoming load while avoiding resource wastage. Furthermore, to reduce processing latency, there is an increasing interest in deploying DSP applications (or parts of them) closer to the data sources, in Fog/Edge computing platforms. However, when moved out of traditional clusters and Cloud data centers, applications have to face new issues, including resource heterogeneity, limited and possibly unstable network bandwidth, reduced processing capacity. These aspects must be necessarily taken into consideration at run-time to optimally use the available resources. To deal with the aforementioned challenges, several mechanisms have been investigated to make applications self-adaptive and respond to change in their working conditions. Among them, operator auto-scaling has been identi ed as an essential feature for DSP systems. Nevertheless, to properly drive application scaling in modern environments, new control architectures and algorithms are necessary in order to adapt application deployment e ciently and e ectively. In this thesis, we study the issues associated with auto-scaling DSP applications in highly distributed, heterogeneous platforms. Observing that most of the existing adaptation solutions rely on centralized controllers, which can su er from scalability issues in Fog-like environments, we design the Elastic and Distributed DSP Framework (EDF), a control architecture for decentralized self-adaptation. EDF relies on a two-layered hierarchy of controllers, with separation of concerns and time scales among layers. In particular, controllers at the top of the hierarchy supervise adaptation of whole DSP applications, coordinating subordinated per-operator managers. We integrate EDF in open-source DSP frameworks, namely Apache Storm and Apache Flink. As a second contribution, we devise auto-scaling policies to be integrated in EDF and tackle three issues that have often been neglected in the literature: adaptation overhead, model uncertainty, and resource heterogeneity. Speci cally, we formulate the auto-scaling problem as a multi-objective Markov Decision Process, where the often signi cant penalty associated with scaling actions is considered along with terms related to performance and deployment costs. Assuming that complete knowledge of the system model, including workload dynamics and operators performance model, is not available, we resort to reinforcement learning (RL) methods to derive auto-scaling policies at run-time. In particular, we adopt model-based RL algorithms, which speedup agent training incorporating partially available knowledge about the system. Then, we extend our solution to deal with heterogeneous computing infrastructures, where not only the parallelism level of operators must be optimized, but the type of used computing nodes as well. As the state space of the resulting model becomes di cult to handle, we integrate function approximation techniques in our RL algorithms. Working on a reduced view of the state space, FA-based algorithms can learn near optimal auto-scaling policies in a scalable manner. As a nal contribution, we consider the additional challenge of workload burstiness, which makes models based on simple characterization of the operator load (e.g., average input data rate) highly inaccurate, hence unable to drive auto-scaling. Therefore, we propose an auto-scaling framework that, exploiting Markovian Arrival Processes for online workload characterization, can guarantee response time requirements satisfaction in face of bursty inputs. Moreover, to overcome the signi cant overhead of parallelism modi cations, we rely on vertical operator scaling as the key adaptation mechanism in this framework, managing to quickly adapt operator resource allocation as needed. Evaluation of our policies, both through simulation and real experiments on top of Storm and Flink, demonstrate the bene ts of combining modeling techniques and online learning techniques. While models remain the fundamental basis of performanceoriented optimization, learning methods allow us to cope with uncertainty about workload characteristics as well as application performance dynamics, which inevitably is encountered in real-world scenarios.

Model-based auto-scaling of distributed data stream processing applications

RUSSO RUSSO, GABRIELE

2020

Abstract

The ubiquitous presence of smart devices at the edge of the network (e.g., wearable devices, smartphones, Internet-of-Things sensors) fosters an unending growth in the amount of daily collected data. As data volume grows, so does the number of datadriven services and applications, whose functionality depends on the ability of extracting valuable information from raw data, usually in near real-time. In this context, distributed Data Stream Processing (DSP) applications play a key role, enabling the analysis of fast data streams with very low latency. The core idea behind DSP applications is processing data as soon as they are available (i.e., without storing them). For this purpose, data are streamed through a network of so-called operators, which apply speci c transformations (e.g., ltering) or computations (e.g., pattern-matching) against input data. By executing operators over multiple distributed nodes, DSP applications can cope with high-volume data streams. Meeting application performance requirements (e.g., maximum processing latency) in modern, highly distributed computing environments poses several challenges. First, applications usually face variable, unpredictable workloads, which call for dynamic resource allocation mechanisms, to sustain the incoming load while avoiding resource wastage. Furthermore, to reduce processing latency, there is an increasing interest in deploying DSP applications (or parts of them) closer to the data sources, in Fog/Edge computing platforms. However, when moved out of traditional clusters and Cloud data centers, applications have to face new issues, including resource heterogeneity, limited and possibly unstable network bandwidth, reduced processing capacity. These aspects must be necessarily taken into consideration at run-time to optimally use the available resources. To deal with the aforementioned challenges, several mechanisms have been investigated to make applications self-adaptive and respond to change in their working conditions. Among them, operator auto-scaling has been identi ed as an essential feature for DSP systems. Nevertheless, to properly drive application scaling in modern environments, new control architectures and algorithms are necessary in order to adapt application deployment e ciently and e ectively. In this thesis, we study the issues associated with auto-scaling DSP applications in highly distributed, heterogeneous platforms. Observing that most of the existing adaptation solutions rely on centralized controllers, which can su er from scalability issues in Fog-like environments, we design the Elastic and Distributed DSP Framework (EDF), a control architecture for decentralized self-adaptation. EDF relies on a two-layered hierarchy of controllers, with separation of concerns and time scales among layers. In particular, controllers at the top of the hierarchy supervise adaptation of whole DSP applications, coordinating subordinated per-operator managers. We integrate EDF in open-source DSP frameworks, namely Apache Storm and Apache Flink. As a second contribution, we devise auto-scaling policies to be integrated in EDF and tackle three issues that have often been neglected in the literature: adaptation overhead, model uncertainty, and resource heterogeneity. Speci cally, we formulate the auto-scaling problem as a multi-objective Markov Decision Process, where the often signi cant penalty associated with scaling actions is considered along with terms related to performance and deployment costs. Assuming that complete knowledge of the system model, including workload dynamics and operators performance model, is not available, we resort to reinforcement learning (RL) methods to derive auto-scaling policies at run-time. In particular, we adopt model-based RL algorithms, which speedup agent training incorporating partially available knowledge about the system. Then, we extend our solution to deal with heterogeneous computing infrastructures, where not only the parallelism level of operators must be optimized, but the type of used computing nodes as well. As the state space of the resulting model becomes di cult to handle, we integrate function approximation techniques in our RL algorithms. Working on a reduced view of the state space, FA-based algorithms can learn near optimal auto-scaling policies in a scalable manner. As a nal contribution, we consider the additional challenge of workload burstiness, which makes models based on simple characterization of the operator load (e.g., average input data rate) highly inaccurate, hence unable to drive auto-scaling. Therefore, we propose an auto-scaling framework that, exploiting Markovian Arrival Processes for online workload characterization, can guarantee response time requirements satisfaction in face of bursty inputs. Moreover, to overcome the signi cant overhead of parallelism modi cations, we rely on vertical operator scaling as the key adaptation mechanism in this framework, managing to quickly adapt operator resource allocation as needed. Evaluation of our policies, both through simulation and real experiments on top of Storm and Flink, demonstrate the bene ts of combining modeling techniques and online learning techniques. While models remain the fundamental basis of performanceoriented optimization, learning methods allow us to cope with uncertainty about workload characteristics as well as application performance dynamics, which inevitably is encountered in real-world scenarios.

Scheda breve

Scheda completa

Scheda completa (DC)

	Corso di studio
	
				Computer science, control and geoinformation
			
	Data di pubblicazione
	
				2020
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				CARDELLINI, VALERIA
LO PRESTI, FRANCESCO
			
	Nome Editore
	
				Università degli Studi di Roma "Tor Vergata"
			
	Collezione di appartenenza
	
				Università degli Studi di Roma Tor Vergata

File in questo prodotto:

File	Dimensione	Formato
thesis_RussoRusso.pdf accesso solo da BNCF e BNCR Dimensione 5.11 MB Formato Adobe PDF	5.11 MB	Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/295621

Il codice NBN di questa tesi è URN:NBN:IT:UNIROMA2-295621