A Metadata Based Approach to Improving
Query Responsiveness


Ling Liu
Dept. of Computer Science and Engineering
Oregon Graduate Institute
P.O.Box 91000 Portland, Oregon
97291-1000 USA
email: lingliu@cse.ogi.edu

Calton Pu
Dept. of Computer Science and Engineering
Oregon Graduate Institute
P.O.Box 91000 Portland, Oregon
97291-1000 USA
email: calton@cse.ogi.edu


Abstract:

We describe a rule-based approach to semantic specification that can be used to dynamically establish semantic relevance between consumers' queries posed on the fly and producers' data sources available online. The semantic specification of user queries is independent of the content and capabilities of the data sources (query-to-source independence). The semantic specification of the content and capability of a data source is independent of the content and capability descriptions of other sources (source-to-source independence). Query processing techniques use these specifications along with the query modification and transformation routines to perform relevance reasoning and to guarantee semantically correct query answers. This work also examines the effect of changing metadata semantics, such as changes in the content and capability descriptions of data sources or the user query profile specifications. Methods are described for detecting these changes and for determining if the information sources can continue to supply meaningful data to answering consumers' queries. These methods and transformation routines are necessary for determining logical connectivity between information producers and information consumers and for establishing dynamic interconnection between a growing collection of data sources and a consumer's query.


Copyright 1997 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

Introduction

1 Introduction

  The rapid growth of World Wide Web (WWW) technology and the increase in network speed and bandwidth have challenged the way we think and the way we obtain and exchange information. Everyone today can publish information on the web independently at any time. The flexibility and autonomy of producing and sharing information on WWW is phenomenal. On the other hand, one has to learn to deal with the rapid increase of volume and diversity of online information and the constant changes of information sources in number, content, and location.

Thus, queries to the current WWW search tools are mostly specified by simply typing in the keywords, the search tools will handle the request and find the sources that match the given keywords. However, the scalability and the dynamic interconnection between producers and consumers are achieved at the price of effectiveness of queries, namely the quality and the responsiveness of the answers, for several reasons. First, responses returned by WWW search tools often contain too much irrelevant information (noise). Second, queries in network-centric information systems are more vulnerable to failure due to congestion of the networks, traffic at the intermediate sites and contention at the sources. Thus it happens frequently that one needs information from multiple data sources and is unable to get and fuse the information from data sources in a timely fashion.

From the data management point of view, the main complications in providing quality access in an open environment such as Internet are the following:

  1. Data sources are autonomous and heterogeneous in nature and the number of sources available online are constantly growing. For example, different book stores or publishing companies may use different data formats for their book databases. Many online virtual bookstores are available on the net.
  2. Information stored in autonomous data sources are often interrelated and possibly replicated. For example, a book can be purchased from multiple bookstores or bookclubs.
  3. Most data sources contain incomplete information. For example, very few book stores or publishers can provide all books on Cancer.
  4. Many data sources are not full-featured database systems and can answer only a small set of queries over their data.
  5. Over time, not only may new data sources or applications need to be added to an already heterogeneous mix, but existing data sources may also change the specifications of the data they provide, and consumers may change the requirements for the data they request.

To deal with these problems, we believe that a global information system must have the ability to allow consumers to pose queries on the fly in the sense that users can issue queries without knowing the structure, location, or existence of requested data. It is not anymore realistic to maintain a pre-defined integrated world (global) view of all data sources due to the dynamic nature of the data sources online and the diversified business objectives of information consumers. Therefore, to provide scalable and high quality query services for accessing heterogeneous data sources, a distributed interoperable information system must have (1) enough information about the content and query capability of the data sources available online, namely source capability profiles (see definition in Section 4), to understand and exploit the relationships between their contents, and (2) a good knowledge of the semantic scope of the user query domain and the context of what users want to receive from a query, namely user query profiles (see definition in Section 4), to find and relate the users' queries with the subset of data sources that actually contribute to answering the query.

Most importantly, we must capture the content and capability of each data sources independently of other sources, thus providing a source-to-source independent metadata management to obtain high scalability of the global information systems. Furthermore, we must allow information consumers to pose queries independently of the list of available data sources and their source-specific data representations, thus providing a query-to-source data independence, and an ability of incorporating new information sources into the answering of existing queries dynamically and seamlessnessly.

This research examines the specification and use of metadata in a simple consumer-producer model. Information consumers pose their queries without requiring any knowledge of data sources, and user query profiles are created and maintained for capturing the semantic annotation or changing requirements of the consumer queries. Information producers' data sources are described through the source capability profiles which capture the metadata information of the sources such as the content, category, query capability and access rights of the sources. We describe a rule-based representation language for both the source capability profile specification and user query profile specification. We examine query processing strategies that use this semantic representation language to determine if the data source can provide the consumer query or application with meaningful data. In particular we describe the role of metadata in query routing, an important dynamic query processing technique in distributed open environments. The methods proposed for dynamic interconnection of consumers' queries with producers' sources allow for changes in the content, numbers, location, and connectivity of the sources or changes in the consumers's query requirements.

The rest of the paper is organized as follows. We first examine related work in the area of metadata management. Then we use a motivating example to describe problems of naive keyword based search and the role of metadata in improving query responsiveness. Following the background and motivation of our work, we introduce the metadata description model for describing user query profiles and source capability profiles. We describe the use of our metadata model and in particular, how the user query profiles are used in conjunction with the source capability profiles to determine the relevance of the sources to a particular user query and to identify and resolve the semantic conflicts between consumers' query representation and the data sources. We also examine the use of our metadata model in a dynamic environment where changes occur in either consumers' query specifications or producers' source capability descriptions. Finally we present our conclusion and describe the areas of future research including the use of metadata in semantic reconciliation and query result packaging and assembly.


2 Metadata

Metadata refers to data about the meaning, content, organization, or purpose of data. Metadata may be as simple as a relational schema or as complicated as class library or information describing the derivation, accuracy, and history of individual data items.

In [10], Sciore, Siegel and Rosenthal describe a metadata approach to facilitate interoperability among heterogeneous information systems. Their approach uses semantic values in the context of relational model and provide transparent context conversions and manipulations of metadata and a context mediator to capture the context-related meta information. In [8], McCarthy presents a metadata representation language. It allows the inclusion of a wide range of metadata accessible through a set of operators specially defined for metadata manipulation. [3] uses knowledge-based representation for metadata. However, neither of these methods discuss the practical means for defining comparable concepts and relating concepts at schema level and data level to facilitate the processing of queries over heterogeneous data sources.

We intend to use metadata to address the following questions:

  1. How can we find the set of information sources that are relevant and are semantically meaningful to answering a user query in an open environment?
  2. Can the way how a user query is posed be independent of changes in the information source content description?
  3. Can the source content and capability description be independent of changes in the semantics of other information sources?


3 A Motivating Example

  Example 1 Suppose we want to search for title, price, supplier, and review of books about cancer, published in 1996, looking at the reviews and compare the prices before making a purchase decision. The parameters of interest to this query are the price, year, title, the supplier name, and the book reviews. We may ask query Q: get the title, price, the supplier, and the reviews of books about cancer that were published in 1996. We may enter the query using a SQL-like on-line form (e.g., The DIOM Interface Query Language: IQL [7]) as follows: