The task of parsing consists of splitting an input text into a sequence of contiguous phrases. Several classes of data compression algorithms rely on parsing techniques either to identify repetitions inside the input string (dictionary-based compression) or to locate homogeneous pieces of data which are separately compressed (Permute- Partition-Compress paradigm). In these applications, the choice of the parsing strategy is determinant for the final performance of the compressor. An ideal parsing algorithm should be able to parse the input in a way that minimizes the output-size of the underlying compressor, and the question is how efficiently this can be done. Many investigations have ocused on parsing algorithms that achieve optimality in the compressor’s output-size but the solutions proposed in literature are far from being satisfactory. In fact, most of them are either simple approaches based on dynamic programming with prohibitive time complexities, or heuristic algorithms which do not offer any bounds on the efficacy of the solution. We propose a new approach to the design of optimal parsing algorithms, achieving significant improvements in running time over previous methods. It is well known that this problem can be modeled as a shortest-path computation over a particular directed-acyclic graph. We build upon this idea by showing that the class of graphs arising from this reduction satisfies particular structural properties that can be exploited by our algorithms to speed-up a lot shortest-path computation. We obtain new results by applying this approach to the contexts of dictionary-based compression and Permute-Partition-Compress paradigm. We consider the class of LZ77-based compressors, the most powerful example of dictionary-based compression, and design the first parser which achieve optimality in the compressed output size (measured in bits) by taking efficient/optimal time and optimal space. Then, using similar techniques, we provide an approximate parsing algorithm that, when used inside the Permute-Partition-Compress paradigm, produces a compressed output whose size is guaranteed to be no more than (1 + ε)-worse the optimal one, where ε is a user defined constant.

Parsing Algorithms for Data Compression

2010

Abstract

The task of parsing consists of splitting an input text into a sequence of contiguous phrases. Several classes of data compression algorithms rely on parsing techniques either to identify repetitions inside the input string (dictionary-based compression) or to locate homogeneous pieces of data which are separately compressed (Permute- Partition-Compress paradigm). In these applications, the choice of the parsing strategy is determinant for the final performance of the compressor. An ideal parsing algorithm should be able to parse the input in a way that minimizes the output-size of the underlying compressor, and the question is how efficiently this can be done. Many investigations have ocused on parsing algorithms that achieve optimality in the compressor’s output-size but the solutions proposed in literature are far from being satisfactory. In fact, most of them are either simple approaches based on dynamic programming with prohibitive time complexities, or heuristic algorithms which do not offer any bounds on the efficacy of the solution. We propose a new approach to the design of optimal parsing algorithms, achieving significant improvements in running time over previous methods. It is well known that this problem can be modeled as a shortest-path computation over a particular directed-acyclic graph. We build upon this idea by showing that the class of graphs arising from this reduction satisfies particular structural properties that can be exploited by our algorithms to speed-up a lot shortest-path computation. We obtain new results by applying this approach to the contexts of dictionary-based compression and Permute-Partition-Compress paradigm. We consider the class of LZ77-based compressors, the most powerful example of dictionary-based compression, and design the first parser which achieve optimality in the compressed output size (measured in bits) by taking efficient/optimal time and optimal space. Then, using similar techniques, we provide an approximate parsing algorithm that, when used inside the Permute-Partition-Compress paradigm, produces a compressed output whose size is guaranteed to be no more than (1 + ε)-worse the optimal one, where ε is a user defined constant.
1-giu-2010
Italiano
Ferragina, Paolo
Università degli Studi di Pisa
File in questo prodotto:
File Dimensione Formato  
tesiIgorNitto.pdf

embargo fino al 24/06/2050

Tipologia: Altro materiale allegato
Dimensione 399.22 kB
Formato Adobe PDF
399.22 kB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/135623
Il codice NBN di questa tesi è URN:NBN:IT:UNIPI-135623