Cyber security involves protecting information and systems from major cyber threats; frequently, some high-level techniques, such as for instance data mining techniques, are be used to efficiently fight, alleviate the effect or to prevent the action of the cybercriminals. In particular, classification can be efficiently used for many cyber security application, i.e. in intrusion detection systems, in the analysis of the user behavior, risk and attack analysis, etc. However, the complexity and the diversity of modern systems opened a wide range of new issues difficult to address. In fact, security softwares have to deal with missing data, privacy limitation and heterogeneous sources. Therefore, it would be really unlikely a single classification algorithm will perform well for all the types of data, especially in presence of changes and with constraints of real time and scalability. To this aim, this thesis proposes a framework based on the ensemble paradigm to cope with these problems. Ensemble is a learning paradigm where multiple learners are trained for the same task by a learning algorithm, and the predictions of the learners are combined for dealing with new unseen instances. The ensemble method helps to reduce the variance of the error, the bias, and the dependence from a single dataset; furthermore, it can be build in an incremental way and it is apt to distributed implementations. It is also particularly suitable for distributed intrusion detection, because it permits to build a network profile by combining different classifiers that together provide complementary information. However, the phase of building of the ensemble could be computationally expensive as when new data arrives, it is necessary to restart the training phase. For this reason, the framework is based on Genetic Programming to evolve a function for combining the classifiers composing the ensemble, having some attractive characteristics. First, the models composing the ensemble can be trained only on a portion of the training set, and then they can be combined and used without any extra phase of training. Moreover the models can be specialized for a single class and they can be designed to handle the difficult problems of unbalanced classes and missing data. In case of changes in the data, the function can be recomputed in an incrementally way, with a moderate computational effort and, in a streaming environment, drift strategies can be used to update the models. In addition, all the phases of the algorithm are distributed and can exploits the advantages of running on parallel/ distributed architectures to cope with real time constraints. The framework is oriented and specialized towards cyber security applications. For this reason, the algorithm is designed to work with missing data, unbalanced classes, models specialized on some tasks and model working with streaming data. Two typical scenarios in the cyber security domain are provided and some experiment are conducted on artificial and real datasets to test the effectiveness of the approach. The first scenario deals with user behavior. The actions taken by users could lead to data breaches and the damages could have a very high cost. The second scenario deals with intrusion detection system. In this research area, the ensemble paradigm is a very new technique and the researcher must completely understand the advantages of this solution.

Ensemble learning techniques for cyber security applications

2017

Abstract

Cyber security involves protecting information and systems from major cyber threats; frequently, some high-level techniques, such as for instance data mining techniques, are be used to efficiently fight, alleviate the effect or to prevent the action of the cybercriminals. In particular, classification can be efficiently used for many cyber security application, i.e. in intrusion detection systems, in the analysis of the user behavior, risk and attack analysis, etc. However, the complexity and the diversity of modern systems opened a wide range of new issues difficult to address. In fact, security softwares have to deal with missing data, privacy limitation and heterogeneous sources. Therefore, it would be really unlikely a single classification algorithm will perform well for all the types of data, especially in presence of changes and with constraints of real time and scalability. To this aim, this thesis proposes a framework based on the ensemble paradigm to cope with these problems. Ensemble is a learning paradigm where multiple learners are trained for the same task by a learning algorithm, and the predictions of the learners are combined for dealing with new unseen instances. The ensemble method helps to reduce the variance of the error, the bias, and the dependence from a single dataset; furthermore, it can be build in an incremental way and it is apt to distributed implementations. It is also particularly suitable for distributed intrusion detection, because it permits to build a network profile by combining different classifiers that together provide complementary information. However, the phase of building of the ensemble could be computationally expensive as when new data arrives, it is necessary to restart the training phase. For this reason, the framework is based on Genetic Programming to evolve a function for combining the classifiers composing the ensemble, having some attractive characteristics. First, the models composing the ensemble can be trained only on a portion of the training set, and then they can be combined and used without any extra phase of training. Moreover the models can be specialized for a single class and they can be designed to handle the difficult problems of unbalanced classes and missing data. In case of changes in the data, the function can be recomputed in an incrementally way, with a moderate computational effort and, in a streaming environment, drift strategies can be used to update the models. In addition, all the phases of the algorithm are distributed and can exploits the advantages of running on parallel/ distributed architectures to cope with real time constraints. The framework is oriented and specialized towards cyber security applications. For this reason, the algorithm is designed to work with missing data, unbalanced classes, models specialized on some tasks and model working with streaming data. Two typical scenarios in the cyber security domain are provided and some experiment are conducted on artificial and real datasets to test the effectiveness of the approach. The first scenario deals with user behavior. The actions taken by users could lead to data breaches and the damages could have a very high cost. The second scenario deals with intrusion detection system. In this research area, the ensemble paradigm is a very new technique and the researcher must completely understand the advantages of this solution.
13-lug-2017
Inglese
Computer security
Machine learning
Crupi, Felice
Folino, Gianluigi
Università della Calabria
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/149902
Il codice NBN di questa tesi è URN:NBN:IT:UNICAL-149902