Next: Metadata Up: A Metadata Based Approach Previous: A Metadata Based Approach

1 Introduction

The rapid growth of World Wide Web (WWW) technology and the increase in network speed and bandwidth have challenged the way we think and the way we obtain and exchange information. Everyone today can publish information on the web independently at any time. The flexibility and autonomy of producing and sharing information on WWW is phenomenal. On the other hand, one has to learn to deal with the rapid increase of volume and diversity of online information and the constant changes of information sources in number, content, and location.

Thus, queries to the current WWW search tools are mostly specified by simply typing in the keywords, the search tools will handle the request and find the sources that match the given keywords. However, the scalability and the dynamic interconnection between producers and consumers are achieved at the price of effectiveness of queries, namely the quality and the responsiveness of the answers, for several reasons. First, responses returned by WWW search tools often contain too much irrelevant information (noise). Second, queries in network-centric information systems are more vulnerable to failure due to congestion of the networks, traffic at the intermediate sites and contention at the sources. Thus it happens frequently that one needs information from multiple data sources and is unable to get and fuse the information from data sources in a timely fashion.

From the data management point of view, the main complications in providing quality access in an open environment such as Internet are the following:

Data sources are autonomous and heterogeneous in nature and the number of sources available online are constantly growing. For example, different book stores or publishing companies may use different data formats for their book databases. Many online virtual bookstores are available on the net.
Information stored in autonomous data sources are often interrelated and possibly replicated. For example, a book can be purchased from multiple bookstores or bookclubs.
Most data sources contain incomplete information. For example, very few book stores or publishers can provide all books on Cancer.
Many data sources are not full-featured database systems and can answer only a small set of queries over their data.
Over time, not only may new data sources or applications need to be added to an already heterogeneous mix, but existing data sources may also change the specifications of the data they provide, and consumers may change the requirements for the data they request.

To deal with these problems, we believe that a global information system must have the ability to allow consumers to pose queries on the fly in the sense that users can issue queries without knowing the structure, location, or existence of requested data. It is not anymore realistic to maintain a pre-defined integrated world (global) view of all data sources due to the dynamic nature of the data sources online and the diversified business objectives of information consumers. Therefore, to provide scalable and high quality query services for accessing heterogeneous data sources, a distributed interoperable information system must have (1) enough information about the content and query capability of the data sources available online, namely source capability profiles (see definition in Section 4), to understand and exploit the relationships between their contents, and (2) a good knowledge of the semantic scope of the user query domain and the context of what users want to receive from a query, namely user query profiles (see definition in Section 4), to find and relate the users' queries with the subset of data sources that actually contribute to answering the query.

Most importantly, we must capture the content and capability of each data sources independently of other sources, thus providing a source-to-source independent metadata management to obtain high scalability of the global information systems. Furthermore, we must allow information consumers to pose queries independently of the list of available data sources and their source-specific data representations, thus providing a query-to-source data independence, and an ability of incorporating new information sources into the answering of existing queries dynamically and seamlessnessly.

This research examines the specification and use of metadata in a simple consumer-producer model. Information consumers pose their queries without requiring any knowledge of data sources, and user query profiles are created and maintained for capturing the semantic annotation or changing requirements of the consumer queries. Information producers' data sources are described through the source capability profiles which capture the metadata information of the sources such as the content, category, query capability and access rights of the sources. We describe a rule-based representation language for both the source capability profile specification and user query profile specification. We examine query processing strategies that use this semantic representation language to determine if the data source can provide the consumer query or application with meaningful data. In particular we describe the role of metadata in query routing, an important dynamic query processing technique in distributed open environments. The methods proposed for dynamic interconnection of consumers' queries with producers' sources allow for changes in the content, numbers, location, and connectivity of the sources or changes in the consumers's query requirements.

The rest of the paper is organized as follows. We first examine related work in the area of metadata management. Then we use a motivating example to describe problems of naive keyword based search and the role of metadata in improving query responsiveness. Following the background and motivation of our work, we introduce the metadata description model for describing user query profiles and source capability profiles. We describe the use of our metadata model and in particular, how the user query profiles are used in conjunction with the source capability profiles to determine the relevance of the sources to a particular user query and to identify and resolve the semantic conflicts between consumers' query representation and the data sources. We also examine the use of our metadata model in a dynamic environment where changes occur in either consumers' query specifications or producers' source capability descriptions. Finally we present our conclusion and describe the areas of future research including the use of metadata in semantic reconciliation and query result packaging and assembly.

Next: Metadata Up: A Metadata Based Approach Previous: A Metadata Based Approach

Ling Liu
Tue Jun 17 15:26:27 PDT 1997