Classification problems have been widely studied in the context of data mining and different approaches to address these problems have been developed in the last decades. Among them, associative classification and decision trees have proved to be very effective and have been successfully employed in several application domains. Furthermore, some of these approaches have integrated the fuzzy set theory with the objective of dealing with uncertain and noise data. Unfortunately, most of the approaches proposed up to now have been designed for maximizing accuracy, often neglecting the complexity both in terms of memory that execution times. Thus, these approaches are generally not able to handle adequately the so-called ``big data''. In this Ph.D. thesis, we propose different solutions in a distributed environment for generating accurate and interpretable classification models for big data. In particular, we focus on associative classification and decision trees, integrating our solutions with fuzzy set theory. Since the generation of such models requires that continuous features are discretized, we also propose a novel distributed discretization approach based on information entropy. This approach has been therefore extended with fuzzy logic for generating fuzzy partitions. Finally, considering the complexity of the models generated by previous solutions, we propose a distributed evolutionary approach for optimizing both accuracy and interpretability of the classifiers. The proposed algorithms are shaped according to the MapReduce programming model and have been deployed on well-known data processing frameworks, widely employed in research as well as industrial contexts. The performance evaluation has been carried out by using different big data benchmarks and the results obtained by the proposed approaches and by some state-of-the-art distributed classification algorithms have been extensively discussed in terms of accuracy, model complexity, and computation time.

Classification Algorithms for Big Data over distributed processing frameworks

2016

Abstract

Classification problems have been widely studied in the context of data mining and different approaches to address these problems have been developed in the last decades. Among them, associative classification and decision trees have proved to be very effective and have been successfully employed in several application domains. Furthermore, some of these approaches have integrated the fuzzy set theory with the objective of dealing with uncertain and noise data. Unfortunately, most of the approaches proposed up to now have been designed for maximizing accuracy, often neglecting the complexity both in terms of memory that execution times. Thus, these approaches are generally not able to handle adequately the so-called ``big data''. In this Ph.D. thesis, we propose different solutions in a distributed environment for generating accurate and interpretable classification models for big data. In particular, we focus on associative classification and decision trees, integrating our solutions with fuzzy set theory. Since the generation of such models requires that continuous features are discretized, we also propose a novel distributed discretization approach based on information entropy. This approach has been therefore extended with fuzzy logic for generating fuzzy partitions. Finally, considering the complexity of the models generated by previous solutions, we propose a distributed evolutionary approach for optimizing both accuracy and interpretability of the classifiers. The proposed algorithms are shaped according to the MapReduce programming model and have been deployed on well-known data processing frameworks, widely employed in research as well as industrial contexts. The performance evaluation has been carried out by using different big data benchmarks and the results obtained by the proposed approaches and by some state-of-the-art distributed classification algorithms have been extensively discussed in terms of accuracy, model complexity, and computation time.
18-mag-2016
Italiano
Marcelloni, Francesco
Ducange, Pietro
Bechini, Alessio
Università degli Studi di Pisa
File in questo prodotto:
File Dimensione Formato  
Segatori_PhD_Thesis.pdf

Open Access dal 08/06/2019

Tipologia: Altro materiale allegato
Dimensione 6.02 MB
Formato Adobe PDF
6.02 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/149160
Il codice NBN di questa tesi è URN:NBN:IT:UNIPI-149160