This thesis revisits two fundamental problems in data structure design: predecessor search and rank/select primitives. These problems are pervasive in applications, particularly in areas such as database systems, search engines, bioinformatics, and Internet routing. We show that real data present a peculiar kind of regularity that can be explained in terms of geometric considerations. We name it "approximate linearity" and analyse its algorithmic effectiveness in a variety of possible input data distributions. We then expand the horizon of compressed data structures by presenting solutions for the problems above that discover, or "learn", in a rigorous and efficient algorithmic way, the approximate linearities present in the data. In addition, we show how to combine this new form of compressibility with the classic repetition-aware approaches thus introducing a new class of compressed indexes. We accompany our theoretical results with implementations and experiments on large amounts of data, and we show that, compared to several well-engineered known compressed indexes, our data structures provide improvements in time, in space or both (often of orders of magnitude).

Learning-based compressed data structures

VINCIGUERRA, GIORGIO
2022

Abstract

This thesis revisits two fundamental problems in data structure design: predecessor search and rank/select primitives. These problems are pervasive in applications, particularly in areas such as database systems, search engines, bioinformatics, and Internet routing. We show that real data present a peculiar kind of regularity that can be explained in terms of geometric considerations. We name it "approximate linearity" and analyse its algorithmic effectiveness in a variety of possible input data distributions. We then expand the horizon of compressed data structures by presenting solutions for the problems above that discover, or "learn", in a rigorous and efficient algorithmic way, the approximate linearities present in the data. In addition, we show how to combine this new form of compressibility with the classic repetition-aware approaches thus introducing a new class of compressed indexes. We accompany our theoretical results with implementations and experiments on large amounts of data, and we show that, compared to several well-engineered known compressed indexes, our data structures provide improvements in time, in space or both (often of orders of magnitude).
18-feb-2022
Italiano
algorithm engineering
algorithms
compressed data structures
data compression
Ferragina, Paolo
File in questo prodotto:
File Dimensione Formato  
report.pdf

accesso aperto

Dimensione 141.01 kB
Formato Adobe PDF
141.01 kB Adobe PDF Visualizza/Apri
thesis.pdf

accesso aperto

Dimensione 2.25 MB
Formato Adobe PDF
2.25 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/216283
Il codice NBN di questa tesi è URN:NBN:IT:UNIPI-216283