The abstraction of data streams encompasses a vast range of diverse applications that continuously generate data and therefore require dedicated algorithms and approaches for exploitation and mining. In this framework both unsupervised and supervised approaches are generally employed, depending on the task and on the availability of annotated data. This thesis proposes novel algorithms and techniques specifically tailored for the streaming setting and for knowledge discovery from Social Networks. In the first part of this work we propose a novel clustering algorithm for data streams. Our investigation stems from the discussion of general challenges posed by cluster analysis and of those purely related to the streaming setting. First, we propose SF-DBSCAN (streaming fuzzy DBSCAN) a preliminary solution conceived as an extension of the popular DBSCAN algorithm. SF-DBSCAN handles the arrival of new objects and continuously updates the clustering result by taking advantage of concepts from fuzzy set theory. However, it gives equal importance to every collected object and therefore is not suitable to manage unbounded data streams and to adapt to evolving settings. Then, we introduce TSF-DBSCAN, a novel "temporal" adaptation of streaming fuzzy DBSCAN: it overcomes the limits of the previous proposal and proves to be effective in handling evolving and potentially unbounded data streams, discovering clusters with fuzzy overlapping borders. In the second part of the thesis we explore a supervised learning application: the goal of our analysis is to discover the public opinion towards the vaccination topic in Italy, by exploiting the popular Twitter platform as data source. First, we discuss the design and development of a system for stance detection from text. The deployment of the classification model for the online monitoring of the public opinion, however, cannot ignore that tweets can be seen as a particular form of a temporal data stream. Then, we discuss the importance of leveraging user-related information, which enables the design of a set of techniques aimed at deepening and enhancing the analysis. Finally, we compare different learning schemes for addressing concept-drift, i.e. a change in the underlying data distribution, in a dynamic environment affected by the occurrence of real world context-related events. In this case study and throughout the thesis, the proposal of algorithms and techniques is supported by in-depth experimental analysis.

Algorithms and techniques for data stream mining

2021

Abstract

The abstraction of data streams encompasses a vast range of diverse applications that continuously generate data and therefore require dedicated algorithms and approaches for exploitation and mining. In this framework both unsupervised and supervised approaches are generally employed, depending on the task and on the availability of annotated data. This thesis proposes novel algorithms and techniques specifically tailored for the streaming setting and for knowledge discovery from Social Networks. In the first part of this work we propose a novel clustering algorithm for data streams. Our investigation stems from the discussion of general challenges posed by cluster analysis and of those purely related to the streaming setting. First, we propose SF-DBSCAN (streaming fuzzy DBSCAN) a preliminary solution conceived as an extension of the popular DBSCAN algorithm. SF-DBSCAN handles the arrival of new objects and continuously updates the clustering result by taking advantage of concepts from fuzzy set theory. However, it gives equal importance to every collected object and therefore is not suitable to manage unbounded data streams and to adapt to evolving settings. Then, we introduce TSF-DBSCAN, a novel "temporal" adaptation of streaming fuzzy DBSCAN: it overcomes the limits of the previous proposal and proves to be effective in handling evolving and potentially unbounded data streams, discovering clusters with fuzzy overlapping borders. In the second part of the thesis we explore a supervised learning application: the goal of our analysis is to discover the public opinion towards the vaccination topic in Italy, by exploiting the popular Twitter platform as data source. First, we discuss the design and development of a system for stance detection from text. The deployment of the classification model for the online monitoring of the public opinion, however, cannot ignore that tweets can be seen as a particular form of a temporal data stream. Then, we discuss the importance of leveraging user-related information, which enables the design of a set of techniques aimed at deepening and enhancing the analysis. Finally, we compare different learning schemes for addressing concept-drift, i.e. a change in the underlying data distribution, in a dynamic environment affected by the occurrence of real world context-related events. In this case study and throughout the thesis, the proposal of algorithms and techniques is supported by in-depth experimental analysis.
2021
Inglese
Alessio Bechini
Università degli Studi di Firenze
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/132687
Il codice NBN di questa tesi è URN:NBN:IT:UNIFI-132687