Ling Liu
Dept. of Computer Science and Engineering
Oregon Graduate Institute
P.O.Box 91000 Portland, Oregon
97291-1000 USA
email: lingliu@cse.ogi.edu
Calton
Pu
Dept. of Computer Science and Engineering
Oregon Graduate Institute
P.O.Box 91000 Portland, Oregon
97291-1000 USA
email: calton@cse.ogi.edu
We describe a rule-based approach to semantic specification that can be used to dynamically establish semantic relevance between consumers' queries posed on the fly and producers' data sources available online. The semantic specification of user queries is independent of the content and capabilities of the data sources (query-to-source independence). The semantic specification of the content and capability of a data source is independent of the content and capability descriptions of other sources (source-to-source independence). Query processing techniques use these specifications along with the query modification and transformation routines to perform relevance reasoning and to guarantee semantically correct query answers. This work also examines the effect of changing metadata semantics, such as changes in the content and capability descriptions of data sources or the user query profile specifications. Methods are described for detecting these changes and for determining if the information sources can continue to supply meaningful data to answering consumers' queries. These methods and transformation routines are necessary for determining logical connectivity between information producers and information consumers and for establishing dynamic interconnection between a growing collection of data sources and a consumer's query.
Thus, queries to the current WWW search tools are mostly specified by simply typing in the keywords, the search tools will handle the request and find the sources that match the given keywords. However, the scalability and the dynamic interconnection between producers and consumers are achieved at the price of effectiveness of queries, namely the quality and the responsiveness of the answers, for several reasons. First, responses returned by WWW search tools often contain too much irrelevant information (noise). Second, queries in network-centric information systems are more vulnerable to failure due to congestion of the networks, traffic at the intermediate sites and contention at the sources. Thus it happens frequently that one needs information from multiple data sources and is unable to get and fuse the information from data sources in a timely fashion.
From the data management point of view, the main complications in providing quality access in an open environment such as Internet are the following:
To deal with these problems, we believe that a global information system must have the ability to allow consumers to pose queries on the fly in the sense that users can issue queries without knowing the structure, location, or existence of requested data. It is not anymore realistic to maintain a pre-defined integrated world (global) view of all data sources due to the dynamic nature of the data sources online and the diversified business objectives of information consumers. Therefore, to provide scalable and high quality query services for accessing heterogeneous data sources, a distributed interoperable information system must have (1) enough information about the content and query capability of the data sources available online, namely source capability profiles (see definition in Section 4), to understand and exploit the relationships between their contents, and (2) a good knowledge of the semantic scope of the user query domain and the context of what users want to receive from a query, namely user query profiles (see definition in Section 4), to find and relate the users' queries with the subset of data sources that actually contribute to answering the query.
Most importantly, we must capture the content and capability of each data sources independently of other sources, thus providing a source-to-source independent metadata management to obtain high scalability of the global information systems. Furthermore, we must allow information consumers to pose queries independently of the list of available data sources and their source-specific data representations, thus providing a query-to-source data independence, and an ability of incorporating new information sources into the answering of existing queries dynamically and seamlessnessly.
This research examines the specification and use of metadata in a simple consumer-producer model. Information consumers pose their queries without requiring any knowledge of data sources, and user query profiles are created and maintained for capturing the semantic annotation or changing requirements of the consumer queries. Information producers' data sources are described through the source capability profiles which capture the metadata information of the sources such as the content, category, query capability and access rights of the sources. We describe a rule-based representation language for both the source capability profile specification and user query profile specification. We examine query processing strategies that use this semantic representation language to determine if the data source can provide the consumer query or application with meaningful data. In particular we describe the role of metadata in query routing, an important dynamic query processing technique in distributed open environments. The methods proposed for dynamic interconnection of consumers' queries with producers' sources allow for changes in the content, numbers, location, and connectivity of the sources or changes in the consumers's query requirements.
The rest of the paper is organized as follows.
We first examine related work in the area of metadata management. Then
we use a motivating example to describe problems of naive keyword
based search and the role of metadata in improving query
responsiveness.
Following the background and motivation of our work, we introduce the
metadata description model for describing user query profiles and
source capability profiles. We describe the use of our metadata model
and in particular, how the
user query profiles are used in conjunction with the source capability
profiles to determine the relevance of the sources to a particular user
query and to identify and resolve the semantic conflicts
between consumers' query representation and the data sources.
We also examine the use of our metadata model in a dynamic environment
where changes occur in either consumers' query specifications or
producers' source capability descriptions. Finally we present our conclusion
and describe the areas of future research including the use of metadata
in semantic reconciliation and query result packaging and assembly.
Metadata refers to data about the meaning, content, organization, or purpose of data. Metadata may be as simple as a relational schema or as complicated as class library or information describing the derivation, accuracy, and history of individual data items.
In [10], Sciore, Siegel and Rosenthal describe a metadata approach to facilitate interoperability among heterogeneous information systems. Their approach uses semantic values in the context of relational model and provide transparent context conversions and manipulations of metadata and a context mediator to capture the context-related meta information. In [8], McCarthy presents a metadata representation language. It allows the inclusion of a wide range of metadata accessible through a set of operators specially defined for metadata manipulation. [3] uses knowledge-based representation for metadata. However, neither of these methods discuss the practical means for defining comparable concepts and relating concepts at schema level and data level to facilitate the processing of queries over heterogeneous data sources.
We intend to use metadata to address the following questions: