Compressing and Efficient Query Processing for Large-Scala Data Sequences

Guerra, Andrea

Time series and integer data sequences play a critical role in many fields, including finance, healthcare, industry, and environmental monitoring. However, efficiently storing and retrieving these sequences remains challenging due to their continuous and unbounded growth. General-purpose compressors such as Xz and Zstd offer strong compression ratios but lack support for efficient random access on compressed data, limiting their use in real-time analytics. In contrast, ad hoc streaming solutions are typically optimized for compression and decompression speed, while giving up compression effectiveness and random access functionality. Furthermore, all these methods lack awareness of certain special regularities and serial correlations of numerical data sequences, whose trends over time can often be described by some linear and nonlinear functions. This thesis introduces novel techniques for compressing and efficiently querying large-scale integer and time series data, focusing on both lossy and lossless compression schemes with random access support. We first present an optimal algorithm for error bound piecewise nonlinear approximation that generalizes the classical linear fragmentation problem to a broad family of functions with different kinds and shapes. Building upon this foundation, we propose NeaTSL, a learned lossy compressor that minimizes space under a strict error bound, and NeaTS, a lossless compressor that compactly encodes residuals from nonlinear approximations to support efficient random access, fast decompression, and range queries. Our empirical evaluation spans 16 real-world datasets from diverse application domains. The results demonstrate that NeaTSLoutperforms existing error bound compressors in space efficiency by up to 14%, while NeaTS achieves state-of-the-art performance in compression ratio, decompression speed, and random access latency—delivering improvements of up to three orders of magnitude. Finally, we apply piecewise nonlinear approximation techniques to the learned indexing problem, showing that nonlinear models can significantly reduce index size while preserving query guarantees. These findings highlight the potential of nonlinear approximations as a powerful tool for building compact and query-efficient data representations in modern analytical systems.

Compressing and Efficient Query Processing for Large-Scala Data Sequences

GUERRA, ANDREA

2025

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)