In the recent years, Machine Learning techniques have emerged as a new way to obtain solutions for a given problem, the novelty of the Machine Learning approach lies in the ability to automatically learn solutions by only looking at the observations of phenomena or examples of the expected behaviour. Besides, Machine Learning techniques may be adopted when the problem exceeds the human ability to find out a solution and are, at the present time, a viable solution also in fields that were previously dominated by the human intelligence. In this thesis we will describe the work developed at the Machine Learning Lab at University of Trieste, consisting in novel Machine Learning techniques aimed at the solution of real world problems of practical interest: automatic synthesis of regular expressions for text extraction and text classification tasks; an approach for the continuous reauthentication of web users; design of algorithms for author verification for text documents; author profiling for text messages; automatic generation of fake textual reviews. Among them the main contribution of this thesis is the design and implementation of new algorithms for the automatic generation of regular expressions for text extraction, based solely on examples of the desired behavior. This is a long-standing problem where we aim at generating a regular expression that generalizes the extraction behavior represented by some examples, i.e., strings annotated by a user with the desired portions to be extracted. The proposed algorithms are based on an evolutionary approach called Genetic Programming (GP), that is an evolutionary computing paradigm which implements an heuristic search, in a space of candidate solution, in a way that mimic the natural evolution process. The results demonstrate that our new algorithms have higher effectiveness than previous proposals and demonstrate that our algorithms are able to generate regular expressions in a way that is competitive with human experts both in terms of effectiveness and generation time. Thanks to these achievements, the proposed method has been awarded with the Silver Medal at the 13th Annual "Humies" Award, an international competition that establishes the state of the art in genetic and evolutionary computation. Later in this thesis we will extend our GP algorithms in order to operate in an Active learning scenario. In this scenario the user is not required to annotate all the examples at once but the Active learning tool interacts with the user in order to assist him during the annotation of the extractions in examples. Moreover, in this thesis we will consider two applications of the proposed regular expressions generator, adapted in order to cope with text categorization problems that are different from text extraction: (i) the Regex Golf game and (ii) the identification of Genic Interactions in sentences. The following contributions leave the field of the Genetic Programming algorithms and will propose solutions based on other Machine Learning methodologies, ranging from Grammatical Evolution to Support Vector Machines and Random Forests to Recurrent Neural Networks. In this contributions, we will propose: a methodology for predicting the accuracy of text extractors; a novel learning algorithm that is able to generate a string similarity function tailored to problems of syntax-based entity extraction from unstructured text streams; a system for continuous reauthentication of web users based on the observed mouse dynamics; a method for the authentication of text document; an algorithm for the user profiling analyzing a set of his/her Twitter tweets; text generators capable of generating, automatically, fake reviews for a given scientific paper and fake consumer reviews for a restaurant.
Genetic Programming Techniques for Regular Expression inference from Examples
TARLAO, FABIANO
2017
Abstract
In the recent years, Machine Learning techniques have emerged as a new way to obtain solutions for a given problem, the novelty of the Machine Learning approach lies in the ability to automatically learn solutions by only looking at the observations of phenomena or examples of the expected behaviour. Besides, Machine Learning techniques may be adopted when the problem exceeds the human ability to find out a solution and are, at the present time, a viable solution also in fields that were previously dominated by the human intelligence. In this thesis we will describe the work developed at the Machine Learning Lab at University of Trieste, consisting in novel Machine Learning techniques aimed at the solution of real world problems of practical interest: automatic synthesis of regular expressions for text extraction and text classification tasks; an approach for the continuous reauthentication of web users; design of algorithms for author verification for text documents; author profiling for text messages; automatic generation of fake textual reviews. Among them the main contribution of this thesis is the design and implementation of new algorithms for the automatic generation of regular expressions for text extraction, based solely on examples of the desired behavior. This is a long-standing problem where we aim at generating a regular expression that generalizes the extraction behavior represented by some examples, i.e., strings annotated by a user with the desired portions to be extracted. The proposed algorithms are based on an evolutionary approach called Genetic Programming (GP), that is an evolutionary computing paradigm which implements an heuristic search, in a space of candidate solution, in a way that mimic the natural evolution process. The results demonstrate that our new algorithms have higher effectiveness than previous proposals and demonstrate that our algorithms are able to generate regular expressions in a way that is competitive with human experts both in terms of effectiveness and generation time. Thanks to these achievements, the proposed method has been awarded with the Silver Medal at the 13th Annual "Humies" Award, an international competition that establishes the state of the art in genetic and evolutionary computation. Later in this thesis we will extend our GP algorithms in order to operate in an Active learning scenario. In this scenario the user is not required to annotate all the examples at once but the Active learning tool interacts with the user in order to assist him during the annotation of the extractions in examples. Moreover, in this thesis we will consider two applications of the proposed regular expressions generator, adapted in order to cope with text categorization problems that are different from text extraction: (i) the Regex Golf game and (ii) the identification of Genic Interactions in sentences. The following contributions leave the field of the Genetic Programming algorithms and will propose solutions based on other Machine Learning methodologies, ranging from Grammatical Evolution to Support Vector Machines and Random Forests to Recurrent Neural Networks. In this contributions, we will propose: a methodology for predicting the accuracy of text extractors; a novel learning algorithm that is able to generate a string similarity function tailored to problems of syntax-based entity extraction from unstructured text streams; a system for continuous reauthentication of web users based on the observed mouse dynamics; a method for the authentication of text document; an algorithm for the user profiling analyzing a set of his/her Twitter tweets; text generators capable of generating, automatically, fake reviews for a given scientific paper and fake consumer reviews for a restaurant.File | Dimensione | Formato | |
---|---|---|---|
Thesis PHD Fabiano Tarlao.pdf
accesso aperto
Dimensione
1.94 MB
Formato
Adobe PDF
|
1.94 MB | Adobe PDF | Visualizza/Apri |
I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/176999
URN:NBN:IT:UNITS-176999