This thesis illustrates some advancements in finite-state methods for solving the string-matching problem and some of its variants. Finite-state automata are central building blocks in string matching algorithms. In particular, the ones relevant for the present study are the Knuth-Morris-Pratt and the suffix automaton as well as their generalizations for the multiple-string-matching problem, i.e., the Aho-Corasick automaton and the automaton induced from the DAWG (Directed Acyclic Word Graph) for a set of strings. In this work I present novel encodings, based on the bit-parallelism technique, of the nondeterministic versions of the aforementioned automata and also illustrate two methods to parallelize a recent algorithm based on the nondeterministic suffix automaton. I further discuss the approximate-string-matching problem. I also show a new simple result concerning the pattern-matching-with-swaps problem and present a new distance function that is defined in terms of edit operations which involve strings rather than single characters. I also present an algorithm, based on dynamic programming and on the DAWG, to solve the approximate-string-matching problem under this distance. I finally discuss the compressed-string-matching problem and present novel results that concern searching in texts encoded with Huffman codes and with the Burrows-Wheeler transform.
Advancements in finite-state methods for string matching
2011
Abstract
This thesis illustrates some advancements in finite-state methods for solving the string-matching problem and some of its variants. Finite-state automata are central building blocks in string matching algorithms. In particular, the ones relevant for the present study are the Knuth-Morris-Pratt and the suffix automaton as well as their generalizations for the multiple-string-matching problem, i.e., the Aho-Corasick automaton and the automaton induced from the DAWG (Directed Acyclic Word Graph) for a set of strings. In this work I present novel encodings, based on the bit-parallelism technique, of the nondeterministic versions of the aforementioned automata and also illustrate two methods to parallelize a recent algorithm based on the nondeterministic suffix automaton. I further discuss the approximate-string-matching problem. I also show a new simple result concerning the pattern-matching-with-swaps problem and present a new distance function that is defined in terms of edit operations which involve strings rather than single characters. I also present an algorithm, based on dynamic programming and on the DAWG, to solve the approximate-string-matching problem under this distance. I finally discuss the compressed-string-matching problem and present novel results that concern searching in texts encoded with Huffman codes and with the Burrows-Wheeler transform.I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/224855
URN:NBN:IT:UNICT-224855