Monitoring, Detecting, Identifying, and Healing Anomalous Workload in Clustered Computing Environments

Elgazazz, Areeg

Context and Motivations. Container-based architecture is emerging as a new approach for building distributed applications as a collection of independent services that work together. As a result, applications are able to be scaled and updated based on the load attributed to each single container. Monitoring the workload in a distributed system is a complex task as the degradation of performance within a single container would cascade and reduce the performance of other dependent containers. Such performance degradation may result from anomalous workload, which could be observed as insufficient response time of an application that would be considered as failure. Hence, knowing workload characteristics in advance helps in controlling system resources that improve system performance. Workload prediction can be used to decide the amount of resources to be allocated for each container or node in the future. The accuracy of workload prediction varies depending on the used prediction methods and the characteristics of the workload. Furthermore, the heterogeneity of resources offered in the cloud may also lead to workload variations that affect the performance of the overall system because some workloads may be CPU intensive whereas others are memory intensive. Such variations may lead to the violation of service level agreements (SLAs) made between service providers and users for specifying the quality of the provided services. Because of the high complexity of cloud applications, modeling the behaviors of applications usually requires domain knowledge which is difficult to obtain. In such a case, anomaly detection, prediction, and localization can help in capturing and tracking the anomalous behavior that deviates from the normal behavior. We aim to investigate how to analyse an anomalous resource behavior in a cluster consisting of nodes with application deployed on containers as their load from a sequence of observations emitted by the resource. Objective. The objective of the thesis is to provide a self-adaptive architecture that detects, locates, heals the anomalous behavior in a containerized cluster environment based on the observed response time. Method. We propose a self-adaptive architecture that compromises two models: Fault Management Models and Recovery Model. The Fault Management Models apply an anomaly type identification mechanism based on the detected anomaly. The Recovery Model provides multiple recovery actions to be applied based on the type of the identified anomaly. At the end, the proposed architecture is evaluated to assess its accuracy in detecting, identifying, predicting, and recovering anomalous behaviors within system components. Results. Different experiments are conducted to show the performance of the proposed architecture. The experiments show that the proposed architecture can detect and locate the anomalous behavior with percent more than 97%. Thesis Statement. Analyzing the anomalous behavior in a containerized cluster environment, and providing multiple recovery actions for the analyzed anomaly, not only increases a system scalability but also reduces the operating cost and the system maintenance.

Monitoring, Detecting, Identifying, and Healing Anomalous Workload in Clustered Computing Environments

Elgazazz, Areeg

2020

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)