On Data Structures for Texts and Permutations

Masillo, Francesco

Due to the exponential growth of genomic data, constructing dedicated data structures has become the principal bottleneck in common bioinformatics applications. A large number of algorithms for constructing and processing data structures on texts exist in the literature, but they are faced with the challenge of keeping up with the volume of new data. Especially in pangenomics, where numerous genomes of the same species are produced and processed together, one can hope to modify existing algorithms or introduce new ones to exploit the repetitiveness that is contained within this kind of datasets. In the first part of this thesis, we will focus on speeding up the computation of a key data structure used in bioinformatics: the matching statistics. We propose various heuristics aimed at improving the running time for the computation of the matching statistics for large sets of similar strings. Then we introduce a compressed representation of the matching statistics. The new representation is shown to be very small in practice while retaining all the information needed for allowing fast random access. This permitted us to use this new representation for constructing other fundamental data structures, namely the suffix array and the Burrows-Wheeler Transform (BWT). Next, we study the topic of merging BWTs, where data belonging to distinct sets is dissimilar. Here, the matching statistics are used to speed up the merging phase, where the algorithm exploits areas of high difference between the two sets. In the second part of the thesis, we show how to handle dynamic strings and dynamic permutations. The dynamic setting is of high interest due to the fast pace at which data is produced nowadays. We present new data structures that are simple and elegant and handle both strings and permutations using a common underlying idea: a forest of splay trees. Splay trees are binary search trees which are particularly well suited for a dynamic setting. Moreover, they are easy to understand and to implement, and have therefore high practical applicability. In the last chapter of this part, we present a simple data structure whose application has significantly improved a recent algorithm for combinatorial generation of permutations. Finally, in the last part, we study the problem of efficient access to Lempel-Ziv-compressed text. The Lempel-Ziv compression (LZ77) is one of the most well-known and most used methods for compressing large amounts of repetitive data. An open problem is how to access arbitrary characters without the need to decompress the whole input. In this thesis, we propose new heuristics to build different kinds of LZ-like parsings that allow constant time access to characters while in compressed form, without compromising the compression ratio by more than a small percentage w.r.t.\ the original LZ parse. All of these new algorithms have been implemented and tested against the main competitors in the respective fields, and they have shown good results. We close with an outlook to future research.

On Data Structures for Texts and Permutations

MASILLO, FRANCESCO

2025

Abstract

Due to the exponential growth of genomic data, constructing dedicated data structures has become the principal bottleneck in common bioinformatics applications. A large number of algorithms for constructing and processing data structures on texts exist in the literature, but they are faced with the challenge of keeping up with the volume of new data. Especially in pangenomics, where numerous genomes of the same species are produced and processed together, one can hope to modify existing algorithms or introduce new ones to exploit the repetitiveness that is contained within this kind of datasets. In the first part of this thesis, we will focus on speeding up the computation of a key data structure used in bioinformatics: the matching statistics. We propose various heuristics aimed at improving the running time for the computation of the matching statistics for large sets of similar strings. Then we introduce a compressed representation of the matching statistics. The new representation is shown to be very small in practice while retaining all the information needed for allowing fast random access. This permitted us to use this new representation for constructing other fundamental data structures, namely the suffix array and the Burrows-Wheeler Transform (BWT). Next, we study the topic of merging BWTs, where data belonging to distinct sets is dissimilar. Here, the matching statistics are used to speed up the merging phase, where the algorithm exploits areas of high difference between the two sets. In the second part of the thesis, we show how to handle dynamic strings and dynamic permutations. The dynamic setting is of high interest due to the fast pace at which data is produced nowadays. We present new data structures that are simple and elegant and handle both strings and permutations using a common underlying idea: a forest of splay trees. Splay trees are binary search trees which are particularly well suited for a dynamic setting. Moreover, they are easy to understand and to implement, and have therefore high practical applicability. In the last chapter of this part, we present a simple data structure whose application has significantly improved a recent algorithm for combinatorial generation of permutations. Finally, in the last part, we study the problem of efficient access to Lempel-Ziv-compressed text. The Lempel-Ziv compression (LZ77) is one of the most well-known and most used methods for compressing large amounts of repetitive data. An open problem is how to access arbitrary characters without the need to decompress the whole input. In this thesis, we propose new heuristics to build different kinds of LZ-like parsings that allow constant time access to characters while in compressed form, without compromising the compression ratio by more than a small percentage w.r.t.\ the original LZ parse. All of these new algorithms have been implemented and tested against the main competitors in the respective fields, and they have shown good results. We close with an outlook to future research.

Scheda breve

Scheda completa

Scheda completa (DC)

	Corso di studio
	
				Informatica
			
	Data di pubblicazione
	
				2025
			
	Lingua
	
				Inglese
			
	Relatore, Supervisor, Advisor o Tutor
	
				Lipták, Zsuzsanna
			
	Numero di pagine
	
				166
			
	Collezione di appartenenza
	
				Università degli Studi di Verona

File in questo prodotto:

File	Dimensione	Formato
mainThesisFM.pdf accesso aperto Licenza: Tutti i diritti riservati Dimensione 3.58 MB Formato Adobe PDF Visualizza/Apri	3.58 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/212223

Il codice NBN di questa tesi è URN:NBN:IT:UNIVR-212223