Balancing and clustering of words: a combinatorial analysis of the Burrows & Wheeler Transform

Rosone, Giovanna

The Burrows-Wheeler Transform (denoted by BWT) is a well founded mathematical transformation on sequences introduced in 1994, widely used in the context of Data Compression and recently studied also from a combinatorial point of view. The transformation does not itself compress the data, but it produces a permutation bwt(w) of an input string w that is easier to compress than the original one, with some fast locally-adaptive algorithms, such as Move-to-Front in combination with Huffman or arithmetic coding. It is well-known that in most real texts, characters with the same or similar contexts tend to be the same. So, the BWT tends to group together characters which occur adjacent to similar text substring and this property is good for compression. The aim of this thesis is to study such “clustering effect” by using notions and methods from Combinatorics on Words. The notion of balance of a word plays a central role in our investigation. Empirical observations suggest that balance is actually the combinatorial property of input word that ensure optimal BWT compression. Moreover, it is reasonable to assume that the more balanced the input word is, the more local similarity one has after BWT (and therefore the better the compression is). This hypothesis is here corroborate by experiments on “real” text, by using local entropy as a measure of the degree of balance of a word. In the setting of Combinatorics on Words, a sound confirmation of previous hypothesis is given by a result in [Mantaci, Restivo, and Sciortino: Burrows Wheeler transform and Sturmian words. Inf. Process. Lett. 86(5): 241-246 (2003)], which states that, in the case of a binary alphabet, there is an equivalence between circularly balanced words, words having a clusterized BWT, and the conjugates of standard words. In the case of alphabet of size greater than two, there is no more equivalence. So we investigate the relationships between these notions, and other related ones (as, for instance, palindromic richness) in the case of a general alphabet.

Balancing and clustering of words: a combinatorial analysis of the Burrows & Wheeler Transform

ROSONE, Giovanna

2010

Abstract

The Burrows-Wheeler Transform (denoted by BWT) is a well founded mathematical transformation on sequences introduced in 1994, widely used in the context of Data Compression and recently studied also from a combinatorial point of view. The transformation does not itself compress the data, but it produces a permutation bwt(w) of an input string w that is easier to compress than the original one, with some fast locally-adaptive algorithms, such as Move-to-Front in combination with Huffman or arithmetic coding. It is well-known that in most real texts, characters with the same or similar contexts tend to be the same. So, the BWT tends to group together characters which occur adjacent to similar text substring and this property is good for compression. The aim of this thesis is to study such “clustering effect” by using notions and methods from Combinatorics on Words. The notion of balance of a word plays a central role in our investigation. Empirical observations suggest that balance is actually the combinatorial property of input word that ensure optimal BWT compression. Moreover, it is reasonable to assume that the more balanced the input word is, the more local similarity one has after BWT (and therefore the better the compression is). This hypothesis is here corroborate by experiments on “real” text, by using local entropy as a measure of the degree of balance of a word. In the setting of Combinatorics on Words, a sound confirmation of previous hypothesis is given by a result in [Mantaci, Restivo, and Sciortino: Burrows Wheeler transform and Sturmian words. Inf. Process. Lett. 86(5): 241-246 (2003)], which states that, in the case of a binary alphabet, there is an equivalence between circularly balanced words, words having a clusterized BWT, and the conjugates of standard words. In the case of alphabet of size greater than two, there is no more equivalence. So we investigate the relationships between these notions, and other related ones (as, for instance, palindromic richness) in the case of a general alphabet.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data di pubblicazione
	
				17-mar-2010
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				Restivo, Antonio
			
	Nome Editore
	
				Università degli Studi di Palermo
			
	Collezione di appartenenza
	
				Università degli Studi di Palermo

File in questo prodotto:

File	Dimensione	Formato
PhdThesis_RosoneGiovanna.pdf accesso solo da BNCF e BNCR Licenza: Tutti i diritti riservati Dimensione 629.65 kB Formato Adobe PDF	629.65 kB	Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/104895

Il codice NBN di questa tesi è URN:NBN:IT:UNIPA-104895