Due to the exponential growth of genomic data, constructing dedicated data structures has become the principal bottleneck in common bioinformatics applications. A large number of algorithms for constructing and processing data structures on texts exist in the literature, but they are faced with the challenge of keeping up with the volume of new data. Especially in pangenomics, where numerous genomes of the same species are produced and processed together, one can hope to modify existing algorithms or introduce new ones to exploit the repetitiveness that is contained within this kind of datasets. In the first part of this thesis, we will focus on speeding up the computation of a key data structure used in bioinformatics: the matching statistics. We propose various heuristics aimed at improving the running time for the computation of the matching statistics for large sets of similar strings. Then we introduce a compressed representation of the matching statistics. The new representation is shown to be very small in practice while retaining all the information needed for allowing fast random access. This permitted us to use this new representation for constructing other fundamental data structures, namely the suffix array and the Burrows-Wheeler Transform (BWT). Next, we study the topic of merging BWTs, where data belonging to distinct sets is dissimilar. Here, the matching statistics are used to speed up the merging phase, where the algorithm exploits areas of high difference between the two sets. In the second part of the thesis, we show how to handle dynamic strings and dynamic permutations. The dynamic setting is of high interest due to the fast pace at which data is produced nowadays. We present new data structures that are simple and elegant and handle both strings and permutations using a common underlying idea: a forest of splay trees. Splay trees are binary search trees which are particularly well suited for a dynamic setting. Moreover, they are easy to understand and to implement, and have therefore high practical applicability. In the last chapter of this part, we present a simple data structure whose application has significantly improved a recent algorithm for combinatorial generation of permutations. Finally, in the last part, we study the problem of efficient access to Lempel-Ziv-compressed text. The Lempel-Ziv compression (LZ77) is one of the most well-known and most used methods for compressing large amounts of repetitive data. An open problem is how to access arbitrary characters without the need to decompress the whole input. In this thesis, we propose new heuristics to build different kinds of LZ-like parsings that allow constant time access to characters while in compressed form, without compromising the compression ratio by more than a small percentage w.r.t.\ the original LZ parse. All of these new algorithms have been implemented and tested against the main competitors in the respective fields, and they have shown good results. We close with an outlook to future research.

On Data Structures for Texts and Permutations

MASILLO, FRANCESCO
2025

Abstract

Due to the exponential growth of genomic data, constructing dedicated data structures has become the principal bottleneck in common bioinformatics applications. A large number of algorithms for constructing and processing data structures on texts exist in the literature, but they are faced with the challenge of keeping up with the volume of new data. Especially in pangenomics, where numerous genomes of the same species are produced and processed together, one can hope to modify existing algorithms or introduce new ones to exploit the repetitiveness that is contained within this kind of datasets. In the first part of this thesis, we will focus on speeding up the computation of a key data structure used in bioinformatics: the matching statistics. We propose various heuristics aimed at improving the running time for the computation of the matching statistics for large sets of similar strings. Then we introduce a compressed representation of the matching statistics. The new representation is shown to be very small in practice while retaining all the information needed for allowing fast random access. This permitted us to use this new representation for constructing other fundamental data structures, namely the suffix array and the Burrows-Wheeler Transform (BWT). Next, we study the topic of merging BWTs, where data belonging to distinct sets is dissimilar. Here, the matching statistics are used to speed up the merging phase, where the algorithm exploits areas of high difference between the two sets. In the second part of the thesis, we show how to handle dynamic strings and dynamic permutations. The dynamic setting is of high interest due to the fast pace at which data is produced nowadays. We present new data structures that are simple and elegant and handle both strings and permutations using a common underlying idea: a forest of splay trees. Splay trees are binary search trees which are particularly well suited for a dynamic setting. Moreover, they are easy to understand and to implement, and have therefore high practical applicability. In the last chapter of this part, we present a simple data structure whose application has significantly improved a recent algorithm for combinatorial generation of permutations. Finally, in the last part, we study the problem of efficient access to Lempel-Ziv-compressed text. The Lempel-Ziv compression (LZ77) is one of the most well-known and most used methods for compressing large amounts of repetitive data. An open problem is how to access arbitrary characters without the need to decompress the whole input. In this thesis, we propose new heuristics to build different kinds of LZ-like parsings that allow constant time access to characters while in compressed form, without compromising the compression ratio by more than a small percentage w.r.t.\ the original LZ parse. All of these new algorithms have been implemented and tested against the main competitors in the respective fields, and they have shown good results. We close with an outlook to future research.
2025
Inglese
166
File in questo prodotto:
File Dimensione Formato  
mainThesisFM.pdf

accesso aperto

Dimensione 3.58 MB
Formato Adobe PDF
3.58 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/212223
Il codice NBN di questa tesi è URN:NBN:IT:UNIVR-212223