MANAGING UNCERTAINTY AND ONTOLOGIES IN DATABASES

Postgraduate

ABSTRACT

Nowadays a vast amount of data is generated in Extensible Markup Language (XML). However, it is necessary for applications in some domains to store and manipulate uncertain information, e.g. when the sensor inputs are noisy, or we want to store data that is uncertain. Another big change we can see in applications and web data is the increasing use of ontologies to describe the semantics of data, i.e., the semantic relationships between the terms in the databases.

As such information is usually absent from traditional databases, there is tremendous opportunity to ask new kinds of queries that could not be handled in the past. This provides new challenges on how to manipulate and maintain such new kinds of database systems.

In this dissertation, we will see how we can (i) incorporate and manipulate uncertainty in databases, and (ii) efficiently compute aggregates and maintain views on ontology databases.

First, I explain applications that require manipulating uncertain information in XML databases and maintaining web ontology databases written in Resource Description Framework (RDF).

I then introduce the probabilistic semistructured PXML data model with two formal semantics. I describe a set of algebraic operations and its efficient implementation. Aggregations of PXML instances are studied with two semantics proposed: possible-worlds semantics and expectation semantics. Efficient algorithms with pruning are given and evaluated to show their feasibility. I introduce PIXML, an interval probability version of PXML, and develop a formal semantics for it. A query language and its operational semantics are given and proved to be sound and complete.

Based on XML, RDF is a language used to describe web ontologies. RDQL, an RDF query language, is extended to support view definition and aggregations. Two sets of algorithms are given to maintain non-aggregate and aggregate views. Experimental results show that they are efficient compared with standard relational view maintenance algorithms.

Introduction

New Challenges in XML Databases

Over the last few years, there has been considerable interest in Extensible Markup Language (XML) databases. A proliferation of semistructured data models have been proposed, along with associated query languages [3, 18] and algebras. XML is a simple but very flexible markup language derived from SGML, which is now mainly used for exchange, transmission and manipulation of data on the web. XML tags are not predefined, which means that it gives users flexibility to define their own tags according to their domains and applications.

Nowadays, while data is more often generated in XML for easier data transmission and manipulation over the web, it is necessary for applications in some domains to store and manipulate uncertain information. For example, this occurs when the sensor inputs are noisy. Another big change we can see in applications and web data is the increasing use of ontologies to describe the semantics of data, i.e., the semantic relationships between the terms stored in the databases. As such information is usually absent from traditional databases, there is tremendous opportunity to ask new kinds of queries that could not be handled in the past. This provides new challenges on how to manipulate and maintain such new kinds of database systems. In this dissertation, I will describe how we can (i) incorporate and manipulate uncertain information in databases, and (ii) efficiently compute aggregates and maintain views on ontology databases.

In the following sections, I will briefly introduce the motivations underlying the above two problems, my contributions and the organization of this dissertation.


Uncertainty in XML

The semistructured data model has the advantage of not placing hard constraints on the structure of the data. However, a particular semistructured instance specifies deterministic relationships between objects. In cases where we would also like to avoid hard constraints on the objectlevel structure, it is desirable to have a model that allows us to represent uncertainty over the relationships between objects in the semistructured model. This uncertainty is necessary when relationships between objects and values for attributes of objects are not known with absolute certainty.

There are numerous applications (including financial, image processing, manufacturing and bioinformatics) for which a probabilistic XML data model is quite natural and for which a query language that supports probabilistic inference provides important functionality. Probabilistic inference supports capabilities for predictive and ‘what-if’ types of analysis. For example, consider the use of a variety of predictive programs for the stock market. Such programs usually return probabilistic information. If a company wanted to export this data into an XML format, they would need methods to store probabilistic data in XML. The financial marketplace is a hotbed of both predictive and XML activity (e.g. the FIX standard for financial data is XML based). There is the same need to store probabilistic data in XML for programs that predict expected energy usage and cost, expected failure rates for machine parts, and in general, for any predictive program. Another useful class of applications where there is a need for probabilistic XML data is image processing programs that process images (automatically) using image identification methods and store the results in an XML database. Such image processing algorithms often use statistical classifiers and often yield uncertain data as output. If such information is to be stored in an XML database, then it would be very useful to have the ability to automatically query this uncertain information. Another important application is in automated manufacturing monitoring and diagnosis. A corporate manufacturing floor may use sensors to track what happens on the manufacturing floor. The results of the sensor readings may be automatically piped to a fault diagnosis program that may identify zero, one, or many possible faults with a variety of probabilities on the space of faults. When such analysis is stored in a database, there is a natural need for probabilities. In addition to these types of applications, information extraction using probabilistic parsing of input sources may also result in a semistructured instance in which there is uncertainty. For example, the NSIR system for searching documents at the University of Michigan returns documents based along with probabilities. Likewise, Nierman and Jagadish. point out the use of probabilistic semistructured databases in protein chemistry.

While there has been a great deal of work on supporting uncertainty in relational models, to date, there has been little work on supporting uncertainty in semistructured models. There are a few exceptions including and. Dekhtyar et al. Proposed a model that allows probabilistic information to be stored using semistructured databases. My proposal does the opposite: I extend the semistructured data model so that paths in such a model can include probabilistic information. More closely related to my work is the work of Nierman and Jagadish, in which a tree-structured probabilistic database is proposed. I will show that their model is a special case of my probabilistic semistructured PXML model.