Tecniche Rigorose e Scalabili per il Mining di Pattern da Dati Sequenziali

Tonon, Andrea

Massive amounts of data are generated continuously in every fields, such as finance, social networks, medicine, and many others. Data mining is the task of discovering interesting patterns exploring large amounts of data, making those useful and understandable. In recent years, the data mining community has tackled and proposed a vast set of problems but many others are still open. Now more than ever, it is of crucial importance to be able to analyze and extract reliable knowledge from massive datasets. However, this fundamental task poses some challenges. The first one is to design algorithms that scale the computation to the analysis of massive datasets. In such a scenario, very often approaches that return rigorous and high quality approximations are the only viable approach. The second one is to develop strategies that extract useful knowledge providing statistical guarantees on the analysis, while filtering out spurious discoveries. Finally, the abundance of data is opening new scenarios, with a lot of sources generating data continuously, requiring analyses that take into account the sequential nature of the data. The objective of this Thesis is to design novel scalable and rigorous techniques to mine patterns from sequential data, in three scenarios. The first scenario we consider is mining frequent sequential patterns through sampling. Sequential pattern mining is a fundamental task in data mining and knowledge discovery, with applications in several areas (e.g., biology), that has been extensively studied in the literature, with the definition of several exact methods. However, for large modern sized datasets, the execution of exact methods is computationally very demanding. In such a direction, we develop an algorithm to mine rigorous approximations, defined in terms of false positives or false negatives, of the frequent sequential patterns using sampling. The second scenario we consider is mining patterns from samples from unknown probability distributions. In many real life applications (e.g., market basket analysis), the analysis of a dataset is performed to gain insight on the underlying generative process of the data. However, by analyzing only a sample, one cannot exactly solve the problem and has to resort to approximations. In this setting, we tackle two problems: the problem of mining, from a single dataset, true frequent sequential patterns, which are sequential patterns frequently generated by the underlying process generating the data; and the problem of mining statistically robust patterns from a sequence of datasets, which are patterns whose probabilities of being generated from the underlying generative processes behind the sequence of datasets follow well specified trends. For both problems, we develop novel algorithms that return rigorous approximations, defined in terms of false positives or false negatives. The last scenario we consider is mining significant patterns. In significant pattern mining, the dataset is seen as a sample from an unknown probability distribution, and the aim is to extract patterns significantly deviating from an assumed null hypothesis with rigorous guarantees in terms of statistical significance of the output, controlling the retrieval of false discoveries. In this setting, we tackle two problems: the problem of mining statistically significant sequential patterns and the problem of mining statistically significant paths in time series data from an unknown network. For both problems, we develop novel algorithms that provide rigorous guarantees in term of false discoveries, employing the statistical hypothesis testing framework and techniques based on permutation testing.

Tecniche Rigorose e Scalabili per il Mining di Pattern da Dati Sequenziali

TONON, ANDREA

2022

Abstract

Massive amounts of data are generated continuously in every fields, such as finance, social networks, medicine, and many others. Data mining is the task of discovering interesting patterns exploring large amounts of data, making those useful and understandable. In recent years, the data mining community has tackled and proposed a vast set of problems but many others are still open. Now more than ever, it is of crucial importance to be able to analyze and extract reliable knowledge from massive datasets. However, this fundamental task poses some challenges. The first one is to design algorithms that scale the computation to the analysis of massive datasets. In such a scenario, very often approaches that return rigorous and high quality approximations are the only viable approach. The second one is to develop strategies that extract useful knowledge providing statistical guarantees on the analysis, while filtering out spurious discoveries. Finally, the abundance of data is opening new scenarios, with a lot of sources generating data continuously, requiring analyses that take into account the sequential nature of the data. The objective of this Thesis is to design novel scalable and rigorous techniques to mine patterns from sequential data, in three scenarios. The first scenario we consider is mining frequent sequential patterns through sampling. Sequential pattern mining is a fundamental task in data mining and knowledge discovery, with applications in several areas (e.g., biology), that has been extensively studied in the literature, with the definition of several exact methods. However, for large modern sized datasets, the execution of exact methods is computationally very demanding. In such a direction, we develop an algorithm to mine rigorous approximations, defined in terms of false positives or false negatives, of the frequent sequential patterns using sampling. The second scenario we consider is mining patterns from samples from unknown probability distributions. In many real life applications (e.g., market basket analysis), the analysis of a dataset is performed to gain insight on the underlying generative process of the data. However, by analyzing only a sample, one cannot exactly solve the problem and has to resort to approximations. In this setting, we tackle two problems: the problem of mining, from a single dataset, true frequent sequential patterns, which are sequential patterns frequently generated by the underlying process generating the data; and the problem of mining statistically robust patterns from a sequence of datasets, which are patterns whose probabilities of being generated from the underlying generative processes behind the sequence of datasets follow well specified trends. For both problems, we develop novel algorithms that return rigorous approximations, defined in terms of false positives or false negatives. The last scenario we consider is mining significant patterns. In significant pattern mining, the dataset is seen as a sample from an unknown probability distribution, and the aim is to extract patterns significantly deviating from an assumed null hypothesis with rigorous guarantees in terms of statistical significance of the output, controlling the retrieval of false discoveries. In this setting, we tackle two problems: the problem of mining statistically significant sequential patterns and the problem of mining statistically significant paths in time series data from an unknown network. For both problems, we develop novel algorithms that provide rigorous guarantees in term of false discoveries, employing the statistical hypothesis testing framework and techniques based on permutation testing.

Scheda breve

Scheda completa

Scheda completa (DC)

	Corso di studio
	
				INGEGNERIA DELL'INFORMAZIONE
			
	Data di pubblicazione
	
				11-mar-2022
			
	Lingua
	
				Inglese
			
	Abstract in italiano
	
				Massive amounts of data are generated continuously in every fields, such as finance, social networks, medicine, and many others. Data mining is the task of discovering interesting patterns exploring large amounts of data, making those useful and understandable. In recent years, the data mining community has tackled and proposed a vast set of problems but many others are still open. Now more than ever, it is of crucial importance to be able to analyze and extract reliable knowledge from massive datasets. However, this fundamental task poses some challenges. The first one is to design algorithms that scale the computation to the analysis of massive datasets. In such a scenario, very often approaches that return rigorous and high quality approximations are the only viable approach. The second one is to develop strategies that extract useful knowledge providing statistical guarantees on the analysis, while filtering out spurious discoveries. Finally, the abundance of data is opening new scenarios, with a lot of sources generating data continuously, requiring analyses that take into account the sequential nature of the data. The objective of this Thesis is to design novel scalable and rigorous techniques to mine patterns from sequential data, in three scenarios.

The first scenario we consider is mining frequent sequential patterns through sampling. Sequential pattern mining is a fundamental task in data mining and knowledge discovery, with applications in several areas (e.g., biology), that has been extensively studied in the literature, with the definition of several exact methods. However, for large modern sized datasets, the execution of exact methods is computationally very demanding. In such a direction, we develop an algorithm to mine rigorous approximations, defined in terms of false positives or false negatives, of the frequent sequential patterns using sampling. 

The second scenario we consider is mining patterns from samples from unknown probability distributions. In many real life applications (e.g., market basket analysis), the analysis of a dataset is performed to gain insight on the underlying generative process of the data. However, by analyzing only a sample, one cannot exactly solve the problem and has to resort to approximations. In this setting, we tackle two problems: the problem of mining, from a single dataset, true frequent sequential patterns, which are sequential patterns frequently generated by the underlying process generating the data; and the problem of mining statistically robust patterns from a sequence of datasets, which are patterns whose probabilities of being generated from the underlying generative processes behind the sequence of datasets follow well specified trends. For both problems, we develop novel algorithms that return rigorous approximations, defined in terms of false positives or false negatives.

The last scenario we consider is mining significant patterns. In significant pattern mining, the dataset is seen as a sample from an unknown probability distribution, and the aim is to extract patterns significantly deviating from an assumed null hypothesis with rigorous guarantees in terms of statistical significance of the output, controlling the retrieval of false discoveries. In this setting, we tackle two problems: the problem of mining statistically significant sequential patterns and the problem of mining statistically significant paths in time series data from an unknown network. For both problems, we develop novel algorithms that provide rigorous guarantees in term of false discoveries, employing the statistical hypothesis testing framework and techniques based on permutation testing.
			
	Relatore, Supervisor, Advisor o Tutor
	
				VANDIN, FABIO
			
	Nome Editore
	
				Università degli studi di Padova
			
	Collezione di appartenenza
	
				Università degli Studi di Padova

File in questo prodotto:

File	Dimensione	Formato
tesi_definitiva_Andrea_Tonon.pdf accesso aperto Dimensione 2.92 MB Formato Adobe PDF Visualizza/Apri	2.92 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/93257

Il codice NBN di questa tesi è URN:NBN:IT:UNIPD-93257