Big data represents what machine learning models need to learn concepts and tasks, providing enough generalization margins. Moreover, it can also be useful for data mining applications when the goal is to extract latent and valuable knowledge. The nature of data we can collect every day is unstructured: no predefined schema ispresent. Some examples are documents, messages, interactions in social media platforms and in e-commerce web sites. Indeed, the unstructured nature of these data, inevitably, adds more challenges in the previously reported cases. Machine learning models need input data in real-valued vectors (feature or design matrix) for supervised and unsupervised learning. The same issue comes back again when we are interested in finding quantitative information (e.g., the similarity between two or more documents or measure any statistics). Another interesting case is the one related to network science techniques to analyze data having (or looking for) a graph structure. Several systems can be efficiently described as nodes that interact with each other (e.g., social networks, recommendation systems, interaction among proteins, financial market). In this case, we cannot apply machine learning techniques directly because of the lack of a vector representation.The advances in the neural networks field, enable to learn feature vectors directly from the input data distribution. This task can be seen as a pre-processing step in a modern machine learning project and involve machine learning itself. This automatic feature extraction is also known as “Representation learning”. We can summarize itas learning a vector representation of input data in a supervised or unsupervised way. ivAbstractWe can learn these representations or embeddings for different kinds of data: from words, entire documents, nodes or edges in a graph, images, and signals. Moreover, embeddings encode data preserving and providing more information, for example, semantic similarity for words that will be close in their vector representation.The choice of how to achieve this preliminary task influences all of the next stages in a typical data-mining pipeline. The challenge consists into finding a way to preserve much information as possible when looking for a structure. Secondary, with the promising results achieved recently, it is interesting to exploit these resulting vector spaces also to make knowledge extraction. In light of the current challenges and advances in the field of Representation Learn-ing with unstructured data, my research activity has been focused on this topic. In particular, in this thesis are reported the achieved results in two main directions:•Using representation learning techniques based on neural networks to extract latent knowledge from documents: If neural networks can capture fundamental aspects among data by learning a different representation in the out-put layer and considering that this representation makes the classification or clustering easier, can we exploit these techniques to extract new knowledge? Part of my research tries to answer to this question. In particular, two use cases are reported: the first analyzes scientific documents from the public repositoryScopus combining word embeddings and Human mobility metrics. The second one regards neural network embeddings to extract knowledge from financial reports of thousands of American companies in the stock market.•Overcome current limits of representation learning on graphs: Machine learning models can benefit from input data derived from graph structures. However, to apply most of the available models it is necessary to get a vector form for nodes and edges. The most promising way is related to the neural Abstractvnetwork embedding and the state of the art is represented by the Node2vecalgorithm. However, there are still two open problems in this field: scalability (learning a representation in large-scale graphs) and the lack of support of dynamic contexts: if a new node joins the network it is necessary to compute again the representation of the entire graph. Part of my doctorate tries to address these two problems. A first contribution is represented by an actor-basedversion of Node2Vec that overcomes scalability issues by distributing the bot-tlenecks with agents that organize themself with different behaviors to achievethe embedding in large-scale graphs. A second contribution is related to the de-velopment of a novel algorithm for incremental feature learning over graphs.The algorithm exploits properties of scale-free graphs to encode new nodeswithout recurring to a re-train of the model over all the nodes. It computes alight embedding over 20% of nodes with the highest degree, and then it per-forms a supervised alignment by solving the orthogonal Procrustes problem.

Neural network embedding: representation learning and latent knowledge extraction for data mining applications

2021

Abstract

Big data represents what machine learning models need to learn concepts and tasks, providing enough generalization margins. Moreover, it can also be useful for data mining applications when the goal is to extract latent and valuable knowledge. The nature of data we can collect every day is unstructured: no predefined schema ispresent. Some examples are documents, messages, interactions in social media platforms and in e-commerce web sites. Indeed, the unstructured nature of these data, inevitably, adds more challenges in the previously reported cases. Machine learning models need input data in real-valued vectors (feature or design matrix) for supervised and unsupervised learning. The same issue comes back again when we are interested in finding quantitative information (e.g., the similarity between two or more documents or measure any statistics). Another interesting case is the one related to network science techniques to analyze data having (or looking for) a graph structure. Several systems can be efficiently described as nodes that interact with each other (e.g., social networks, recommendation systems, interaction among proteins, financial market). In this case, we cannot apply machine learning techniques directly because of the lack of a vector representation.The advances in the neural networks field, enable to learn feature vectors directly from the input data distribution. This task can be seen as a pre-processing step in a modern machine learning project and involve machine learning itself. This automatic feature extraction is also known as “Representation learning”. We can summarize itas learning a vector representation of input data in a supervised or unsupervised way. ivAbstractWe can learn these representations or embeddings for different kinds of data: from words, entire documents, nodes or edges in a graph, images, and signals. Moreover, embeddings encode data preserving and providing more information, for example, semantic similarity for words that will be close in their vector representation.The choice of how to achieve this preliminary task influences all of the next stages in a typical data-mining pipeline. The challenge consists into finding a way to preserve much information as possible when looking for a structure. Secondary, with the promising results achieved recently, it is interesting to exploit these resulting vector spaces also to make knowledge extraction. In light of the current challenges and advances in the field of Representation Learn-ing with unstructured data, my research activity has been focused on this topic. In particular, in this thesis are reported the achieved results in two main directions:•Using representation learning techniques based on neural networks to extract latent knowledge from documents: If neural networks can capture fundamental aspects among data by learning a different representation in the out-put layer and considering that this representation makes the classification or clustering easier, can we exploit these techniques to extract new knowledge? Part of my research tries to answer to this question. In particular, two use cases are reported: the first analyzes scientific documents from the public repositoryScopus combining word embeddings and Human mobility metrics. The second one regards neural network embeddings to extract knowledge from financial reports of thousands of American companies in the stock market.•Overcome current limits of representation learning on graphs: Machine learning models can benefit from input data derived from graph structures. However, to apply most of the available models it is necessary to get a vector form for nodes and edges. The most promising way is related to the neural Abstractvnetwork embedding and the state of the art is represented by the Node2vecalgorithm. However, there are still two open problems in this field: scalability (learning a representation in large-scale graphs) and the lack of support of dynamic contexts: if a new node joins the network it is necessary to compute again the representation of the entire graph. Part of my doctorate tries to address these two problems. A first contribution is represented by an actor-basedversion of Node2Vec that overcomes scalability issues by distributing the bot-tlenecks with agents that organize themself with different behaviors to achievethe embedding in large-scale graphs. A second contribution is related to the de-velopment of a novel algorithm for incremental feature learning over graphs.The algorithm exploits properties of scale-free graphs to encode new nodeswithout recurring to a re-train of the model over all the nodes. It computes alight embedding over 20% of nodes with the highest degree, and then it per-forms a supervised alignment by solving the orthogonal Procrustes problem.
2021
Inglese
word embedding
neural networks
machine learning
representation learning
graph embedding
knowledge discovery
Poggi, Agostino
Pardalos, Panos M.
Università degli Studi di Parma
File in questo prodotto:
File Dimensione Formato  
Tesi_PHD_Gianfranco_Lombardo.pdf

accesso solo da BNCF e BNCR

Tipologia: Altro materiale allegato
Dimensione 3.42 MB
Formato Adobe PDF
3.42 MB Adobe PDF
Relazione%20finale%20Dottorato%20in%20Tecnologie%20dell%e2%80%99Informazione%20%282017-2020%29.pdf

accesso solo da BNCF e BNCR

Tipologia: Altro materiale allegato
Dimensione 5.5 kB
Formato Adobe PDF
5.5 kB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/154957
Il codice NBN di questa tesi è URN:NBN:IT:UNIPR-154957