In this thesis we address the two distinct but interleaved problems of benchmarking and optimization of ontology-based data access (OBDA) systems. OBDA is an approach to cope with the emerging need of providing an understandable view of the data stored in legacy systems. The OBDA solution to this problem is to separate the user from the data sources by means of a formal specification of domain knowledge that exposes a conceptual view of the data, called ontology. By accessing the data through the conceptual view, the user can query it by means of a more convenient vocabulary, does not need to be aware of storage details, and can obtain richer answers by the interleaving of the data and the domain knowledge. Although prototype OBDA systems are available, their questionable performance remains a significant bottleneck to their wider adoption. At the current time, in fact, there is a need of shifting the focus from the theoretical studies, which have been very fruitful as witnessed by the abundant literature on the subject, to the study of practical solutions. This need provides the motivation for the work in this thesis. As a first step, we have focused our attention on the problem of how to assess the performance of an OBDA system through benchmarking. In doing so, we have identified a series of guidelines on how such systems should be evaluated. These guidelines are based on real-world applicative scenarios, such as industrial applications of OBDA. We have devised a novel benchmark along the identified guidelines and based on real data coming from the oil industry. The benchmark comes with a data generator able to produce, from an initial data seed, datasets of increasing size while taking into account the requirements dictated by the OBDA setting. The devised data generator is not specific to our benchmark, but can be re-used in any setting in which an ontology and an initial data instance are available, without manual input from the end user. This has been done so as to ease and incentivize the proliferation of future OBDA benchmarks. We have then shifted our focus to the problem of optimization of OBDA systems, so as to make them usable in practical scenarios. With this respect, we have studied two different solutions to address the problem. The first solution is based on the observation that certain storage details and policies, prior to this work totally transparent to the OBDA paradigm, could be encoded into constraints able to enhance the performance of query answering up to orders of magnitude in complex real-world industrial scenarios and in presence of large enterprise databases. In this thesis we provide a formalization for such constraints, explain how they can (or cannot) be used to improve the performance of the OBDA system, and clearly single out the reasons why performance improvements take place. The second solution was inspired by the field of query processing in traditional relational database management systems (RDBMSs). In particular, we have studied the possibility of enriching an OBDA system with a planner able to choose the best execution plan for the query at hand. The choice is taken according to a cost model that estimates the resources consumption of each alternative. We have devised a cost model specific for the OBDA scenario that uses statistics traditionally used in RDBMSs as well as OBDA-driven measures. Our experiments show that alternative execution plans to the standard choice of current OBDA implementations can lead to major improvements in the performance of query answering. Moreover, they seem to confirm that our cost model is able to estimate which plan is the best to choose.
Benchmarking and Optimization of OBDA Systems
-
2018
Abstract
In this thesis we address the two distinct but interleaved problems of benchmarking and optimization of ontology-based data access (OBDA) systems. OBDA is an approach to cope with the emerging need of providing an understandable view of the data stored in legacy systems. The OBDA solution to this problem is to separate the user from the data sources by means of a formal specification of domain knowledge that exposes a conceptual view of the data, called ontology. By accessing the data through the conceptual view, the user can query it by means of a more convenient vocabulary, does not need to be aware of storage details, and can obtain richer answers by the interleaving of the data and the domain knowledge. Although prototype OBDA systems are available, their questionable performance remains a significant bottleneck to their wider adoption. At the current time, in fact, there is a need of shifting the focus from the theoretical studies, which have been very fruitful as witnessed by the abundant literature on the subject, to the study of practical solutions. This need provides the motivation for the work in this thesis. As a first step, we have focused our attention on the problem of how to assess the performance of an OBDA system through benchmarking. In doing so, we have identified a series of guidelines on how such systems should be evaluated. These guidelines are based on real-world applicative scenarios, such as industrial applications of OBDA. We have devised a novel benchmark along the identified guidelines and based on real data coming from the oil industry. The benchmark comes with a data generator able to produce, from an initial data seed, datasets of increasing size while taking into account the requirements dictated by the OBDA setting. The devised data generator is not specific to our benchmark, but can be re-used in any setting in which an ontology and an initial data instance are available, without manual input from the end user. This has been done so as to ease and incentivize the proliferation of future OBDA benchmarks. We have then shifted our focus to the problem of optimization of OBDA systems, so as to make them usable in practical scenarios. With this respect, we have studied two different solutions to address the problem. The first solution is based on the observation that certain storage details and policies, prior to this work totally transparent to the OBDA paradigm, could be encoded into constraints able to enhance the performance of query answering up to orders of magnitude in complex real-world industrial scenarios and in presence of large enterprise databases. In this thesis we provide a formalization for such constraints, explain how they can (or cannot) be used to improve the performance of the OBDA system, and clearly single out the reasons why performance improvements take place. The second solution was inspired by the field of query processing in traditional relational database management systems (RDBMSs). In particular, we have studied the possibility of enriching an OBDA system with a planner able to choose the best execution plan for the query at hand. The choice is taken according to a cost model that estimates the resources consumption of each alternative. We have devised a cost model specific for the OBDA scenario that uses statistics traditionally used in RDBMSs as well as OBDA-driven measures. Our experiments show that alternative execution plans to the standard choice of current OBDA implementations can lead to major improvements in the performance of query answering. Moreover, they seem to confirm that our cost model is able to estimate which plan is the best to choose.I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/20.500.14242/250413
URN:NBN:IT:UNIBZ-250413