Space-Efficient Data Structures for Collections of Textual Data

Ottaviano, Giuseppe

This thesis focuses on the design of succinct and compressed data structures for collections of string-based data, specifically sequences of semi-structured documents in textual format, sets of strings, and sequences of strings. The study of such collections is motivated by a large number of applications both in theory and practice. For textual semi-structured data, we introduce the concept of semi-index, a succinct construction that speeds up the access to documents encoded with textual semi-structured formats, such as JSON and XML, by storing separately a compact description of their parse trees, hence avoiding the need to re-parse the documents every time they are read. For string dictionaries, we describe a data structure based on a path decomposition of the compacted trie built on the string set. The tree topology is encoded using succinct data structures, while the node labels are compressed using a simple dictionary-based scheme. We also describe a variant of the path-decomposed trie for scored string sets, where each string has a score. This data structure can support efficiently top-k completion queries, that is, given a string p and an integer k, return the k highest scored strings among those prefixed by p. For sequences of strings, we introduce the problem of compressed indexed sequences of strings, that is, representing indexed sequences of strings in nearly-optimal compressed space, both in the static and dynamic settings, while supporting supports random access, searching, and counting operations, both for exact matches and prefix search. We present a new data structure, the Wavelet Trie, that solves the problem by combining a Patricia trie with a wavelet tree. The Wavelet Trie improves on the state-of-the-art compressed data structures for sequences by supporting a dynamic alphabet and prefix queries. Finally, we discuss the issue of the practical implementation of the succinct primitives used throughout the thesis for the experiments. These primitives are implemented as part of a publicly available library, Succinct, using state-of-the-art algorithms along with some improvements.

Space-Efficient Data Structures for Collections of Textual Data

OTTAVIANO, GIUSEPPE

2013

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

	Data di pubblicazione
	
				30-mag-2013
			
	Lingua
	
				Italiano
			
	Parola chiave
	
				compression
semi-structured data
sequences
strings
succinct data structures
			
	Relatore, Supervisor, Advisor o Tutor
	
				Grossi, Roberto
			
	Collezione di appartenenza
	
				Università degli Studi di Pisa

File in questo prodotto:

File	Dimensione	Formato
thesis.pdf accesso aperto Tipologia: Altro materiale allegato Licenza: Tutti i diritti riservati Dimensione 1.03 MB Formato Adobe PDF Visualizza/Apri	1.03 MB	Adobe PDF	Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14242/134435

Il codice NBN di questa tesi è URN:NBN:IT:UNIPI-134435