The main focus of this thesis is the data summary problem, where one takes a large set of multidimensional data points in the Euclidean space, creates a data summary that is much smaller and then uses only the summary to approximately answer range count queries over the data. Range count queries are a crucial tool for data analysis and selectivity estimation in query optimizers. While traditionally the functionality of data summaries is limited to estimates and confidence intervals for the range count, in this thesis the functionality is extended to lower and upper bounds. For this purpose, DigitHist andà,  SliceHist are proposed as two novel histogram-based data summaries and the u-error as a novel metric to assess the histogram precision. A key feature of DigitHist is a fast one-scan construction, whereas SliceHist offers strong theoretical guarantees. The DigitHist summary is comprised of multiple equi-width histograms and summarizes dense areas separately to avoid large buckets with many points. In addition to that, DigitHist incorporates one-dimensional summaries to prevent oversimplifying assumptions about one-dimensional distributions that can lead to large errors. The experiments show that DigitHist offers uniquely tight bounds in a moderate number of dimensions. The SliceHist summary computes approximate ranks of points, summarizes them along regular grids and then summarizes the points in each grid slice again in the same way. The recursive summarization of points produces a compact epsilon-approximation, i.e., a data summary that guarantees for any dataset and query range to approximate the number of points inside the range with absolute error less than eps n, where n is the data size. SliceHist proves the existence of an epsilon-approximation data summary of size O( 1/(e^L) log(1/e) ){arepsilon} ) for any arbitrary real L > 1, which was previously not even conjectured. The experiments show that SliceHist offers comparable estimation accuracy to summaries without guarantees in up to four dimensions, but additionally offers theoretical guarantees and logarithmic query times. Compared to existing epsilon-approximations it can be constructed significantly faster and offers tighter guarantees and higher estimation accuracy in three and four dimensions. This thesis also tackles two other problems related to data summaries. The first problem is to quantify the similarity of datasets with regards to range count queries. For this purpose, the discrepancy distance is proposed that bounds for a pair of datasets their maximal deviation over all range queries and can be efficiently approximated using SliceHist. The second problem is to compute range sums over low-dimensional data cubes such as grid histograms. Sparse Prefix Sums are proposed to achieve constant-time querying with significantly lowered storage costs for sparsely populated data cubes.

Reliable Multidimensional Data Summaries Using Histograms

-
2018

Abstract

The main focus of this thesis is the data summary problem, where one takes a large set of multidimensional data points in the Euclidean space, creates a data summary that is much smaller and then uses only the summary to approximately answer range count queries over the data. Range count queries are a crucial tool for data analysis and selectivity estimation in query optimizers. While traditionally the functionality of data summaries is limited to estimates and confidence intervals for the range count, in this thesis the functionality is extended to lower and upper bounds. For this purpose, DigitHist andà,  SliceHist are proposed as two novel histogram-based data summaries and the u-error as a novel metric to assess the histogram precision. A key feature of DigitHist is a fast one-scan construction, whereas SliceHist offers strong theoretical guarantees. The DigitHist summary is comprised of multiple equi-width histograms and summarizes dense areas separately to avoid large buckets with many points. In addition to that, DigitHist incorporates one-dimensional summaries to prevent oversimplifying assumptions about one-dimensional distributions that can lead to large errors. The experiments show that DigitHist offers uniquely tight bounds in a moderate number of dimensions. The SliceHist summary computes approximate ranks of points, summarizes them along regular grids and then summarizes the points in each grid slice again in the same way. The recursive summarization of points produces a compact epsilon-approximation, i.e., a data summary that guarantees for any dataset and query range to approximate the number of points inside the range with absolute error less than eps n, where n is the data size. SliceHist proves the existence of an epsilon-approximation data summary of size O( 1/(e^L) log(1/e) ){arepsilon} ) for any arbitrary real L > 1, which was previously not even conjectured. The experiments show that SliceHist offers comparable estimation accuracy to summaries without guarantees in up to four dimensions, but additionally offers theoretical guarantees and logarithmic query times. Compared to existing epsilon-approximations it can be constructed significantly faster and offers tighter guarantees and higher estimation accuracy in three and four dimensions. This thesis also tackles two other problems related to data summaries. The first problem is to quantify the similarity of datasets with regards to range count queries. For this purpose, the discrepancy distance is proposed that bounds for a pair of datasets their maximal deviation over all range queries and can be efficiently approximated using SliceHist. The second problem is to compute range sums over low-dimensional data cubes such as grid histograms. Sparse Prefix Sums are proposed to achieve constant-time querying with significantly lowered storage costs for sparsely populated data cubes.
2018
en
Approximate query processing
Data summary structures
Databases
Distance estimation
Index
Online analytical processing (OLAP)
Synopsis data structures
Libera Università di Bolzano
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/273460
Il codice NBN di questa tesi è URN:NBN:IT:UNIBZ-273460