Our society is generating an exponentially increasing amount of data that is becoming progressively repetitive. While compressed data structures have traditionally played a crucial role in addressing repetitiveness, the current dynamic data flow is characterised by new emerging patterns and trends. Ignoring this evolving tendency means missing out on the opportunity to significantly enhance both space and time efficiency in system performance. In this thesis, we design, implement, and experimentally validate innovative, distinctive, and data-aware compressed data structures for a wide set of data types. As a result, our schemes automatically tailor to new patterns and trends arising from Big Data using brand new algorithms, as well as state-of-the-art machine-learning-inspired techniques. This research introduces a learned approach to address the ubiquitous problem of com- pressing and indexing integers. Additionally, it explores data-aware optimisation strategies for constructing compressed trie structures, thereby indexing and compressing strings. The exploration extends to theoretically grounded solutions for selecting compression encodings for table columns within industrial analytical database management systems. Furthermore, it delves into the compression of huge source code datasets, considering file similarity based on the actual content. To underscore the substantial practical benefit these techniques bring, they have been thoroughly compared against well-engineered known solutions. The dataset size goes from tens of GB of integers and strings to petabytes scale database columns and ultra-large-scale source code datasets. In conclusion, this PhD thesis represents a contribution to the evolving landscape of data management and compression in the era of Big Data. The data-aware compressed data structures proposed and examined herein contribute to the emerging trend of designing adaptive systems that can automatically tailor themselves to diverse patterns.

Designing new compressed data structures using data-aware approaches

BOFFA, ANTONIO
2024

Abstract

Our society is generating an exponentially increasing amount of data that is becoming progressively repetitive. While compressed data structures have traditionally played a crucial role in addressing repetitiveness, the current dynamic data flow is characterised by new emerging patterns and trends. Ignoring this evolving tendency means missing out on the opportunity to significantly enhance both space and time efficiency in system performance. In this thesis, we design, implement, and experimentally validate innovative, distinctive, and data-aware compressed data structures for a wide set of data types. As a result, our schemes automatically tailor to new patterns and trends arising from Big Data using brand new algorithms, as well as state-of-the-art machine-learning-inspired techniques. This research introduces a learned approach to address the ubiquitous problem of com- pressing and indexing integers. Additionally, it explores data-aware optimisation strategies for constructing compressed trie structures, thereby indexing and compressing strings. The exploration extends to theoretically grounded solutions for selecting compression encodings for table columns within industrial analytical database management systems. Furthermore, it delves into the compression of huge source code datasets, considering file similarity based on the actual content. To underscore the substantial practical benefit these techniques bring, they have been thoroughly compared against well-engineered known solutions. The dataset size goes from tens of GB of integers and strings to petabytes scale database columns and ultra-large-scale source code datasets. In conclusion, this PhD thesis represents a contribution to the evolving landscape of data management and compression in the era of Big Data. The data-aware compressed data structures proposed and examined herein contribute to the emerging trend of designing adaptive systems that can automatically tailor themselves to diverse patterns.
4-mag-2024
Italiano
algorithms
compresseddatastructure
database
datacompression
datastructure
Ferragina, Paolo
File in questo prodotto:
File Dimensione Formato  
PhD_Thesis_AntonioBoffa_final.pdf

accesso aperto

Dimensione 1.89 MB
Formato Adobe PDF
1.89 MB Adobe PDF Visualizza/Apri
Report_on_the_PhD_activities.pdf

non disponibili

Dimensione 96.93 kB
Formato Adobe PDF
96.93 kB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/216600
Il codice NBN di questa tesi è URN:NBN:IT:UNIPI-216600