Overview
Motivation
The Internet today contains millions of hosts that provide useful information accessible from anywhere in the world. Many of the information sources are partially overlapping, and have varying and yet limited querying capabilities.
Locating and accessing information in a rapidly growing, heterogeneous, distributed
collection of data sources such as on the Internet is a difficult problem of growing importance, especially for large-scale data intensive systems where a single point of contact is desired.
However, neither the organization of the data nor the available Internet tools for associative access to the data is adequate for quickly discovering relevant
information.
Query Routing
Query routing is a process of directing user queries to appropriate servers by constraining the search space through query refinement and source selection. The goal of a query routing middleware is to reduce both false positives (useless answers delivered that fail to fulfill user's needs) and false negatives (useful answers that
the system fails to deliver to the user) in answering a query. Effective query routing not only minimizes the query response time and the overall processing cost, but
also eliminates a lot of unnecessary communication overhead over the global networks and over the individual information sources.
The goal of a query routing system is to provide efficient associative access to a
large, heterogeneous, distributed collection of information providers through routing a user query to the most relevant information sources that can provide the best
answer. We pursue this goal along two dimensions: The first dimension is to develop a set of methods and techniques that will incorporate query routing step into the
query evaluation and search process to improve query responsiveness and system scalability. The second dimension is to build a working system that demonstrates our ideas, concepts, and techniques developed for efficient and responsive query and search using real-world application scenarios.
AQR Software
AQR is an adaptive query routing middleware system.
In AQR, a query routing task is divided into two cooperating processes:
query refinement and source selection.
It is well known that a broadly defined query inevitably produces many false positives. Query refinement is a process in which various refinement
mechanisms are used to narrow a query definition to focus it on only useful answers
.
As a complimentary process, source selection reduces false negatives by identifying and locating a set of relevant information providers from a large collection of available sources. By pruning irrelevant information sources for a user query
, source selection also reduces the overhead of contacting the information servers
that do not contribute to the answer of the query.
The AQR Software serves as an adaptive middleware layer that incorporates several different query routing mechanisms and provides architecture for deploying them in an open networked environment. AQR also includes facilities for performance monitoring, which allows a system developer to examine the impact of using different query routing mechanisms.
Query Refinement
In a large, rapidly evolving network of information servers, users will inevitably
submit broadly defined queries that produce enormous result sets with many false po
sitives. Such enormous result sets are likely to adversely impact system performance and overwhelm the user with unwanted material. Query refinement is a value-added
functionality to help the user formulate queries that will return more useful results and that can be processed efficiently.
Query refinement can be seen as a mechanism that recommends query modifications to
reduce false positives. For instance, the query refinement can be used to
assist a user in formulating well focused queries by suggesting terms related to a query. These terms can be used for helping the user narrow the query definition to focus it on documents of interest by either formulating
new queries or refining a query with more focus.
Query refinement incrementally drives the process of focusing the general query by
iteratively offering the user an automatically generated set of terms that can be chosen to make the query more specific.
There are a number of methods for deriving recommended terms to add to a user's query. Typical query refinement algorithms rely on collocation of terms. A commonly used approach consists of two basic steps: it first computes the set of documents that contain one or more terms from the user's query and then suggests to the user the
terms with the highest cumulative frequency over the above set of documents. Any concrete query refinement algorithm may incrementally drive the refinement process by iterating these two steps. We refer to this approach as
source-documents based term collocation.
In AQR, a new approach to query refinement is explored, which uses the user
query profiles as a means to assist a user in formulating well focused queries. The
main idea is to derive recommended terms based on the semantic context and scope of what the user wants in a particular query, and replace the terms that are too broad in the original query definition with the recommended terms. For instance, in response to a query ``book suppliers", the query routing system will derive the following recommended terms: ``book store", ``book club", ``publisher". These recommended terms are either derived automatically from domain-specific ontology (e.g., ontology on book related application domain in this case) or obtained directly from the
user's feedback on the query context. We refer to our approach to finding collocation of terms as the ontology-based term collocation.
A significant difference between the user-query-profile approach to query refinement and the approach based on collocation of terms in the source documents, lies in the ways by which additional terms are derived. The query refinement approach, driven by user-query-profiles, computes and recommends terms to focus a query primarily
based on the domain knowledge of the terms used in the original user query, and thus is independent of the collection of source documents over which the query is posed.
An obvious advantage of the ontology-based term collocation approach is its ability
to reduce false positives before the actual run of the query, thus enhancing the efficiency and accuracy of the query refinement algorithms.
Source Selection
Source selection
is a mechanism that helps the user locate relevant information by identifying and pruning irrelevant information sources for a user query, thus
reducing the overhead of contacting the information servers that do not contribute to the answer of the query.
Source selection is a process in query routing, aiming at reducing or minimizing false negatives in the result set of a query.
Given a query and a distributed collection of information sources, the first decision one needs to make is which of the following four different search semantics
is required for this specific query:
-
Exhaustive Search -
The user is interested in the set of all information
sources that contain any matching documents.
- All-Best Search -
The user is interested in all the best data sources fo
r the query. The best data sources are the ones that contain more matching documents
than any other data sources and have the highest payoff. In this case, the user is
willing to miss some data sources, as long as they are not the best ones. Usually,
a threshold-based approach is used for qualification of the best sources.
-
Sample-Best Search -
The user is interested in only to search some of th
e best data sources that yield the highest payoff, because of limited resources (e.
g., time, budget).
-
Sample Search -
The user is in ``browsing" mode and simply wants to obtain some matching documents (normally with the shortest responsive time).
Given a query and its search semantics, there are a number of alternative approaches to compute the best data sources for a particular query, depending on the types and the capabilities of the information sources.
- Use the number of source documents and the number of matching documents as a measure of server relevance.
For instance, one may use a histogram mechanism based on a probabilistic scheme, which characterizes the contents of each information server by extracting a histogram of its word occurrences. The histograms are then used for estimating the result size of each query. Those data sources having the highest payoff (i.e., the largest number of documents) are chosen as the best information servers for searching in the context of All-Best and Sample-Best semantics. The main problem with such an approach is two folds: On one hand, the estimate of word occurrences may not at all accurate, especially for those sources that contain evolving documents. On the other hand, without explicit use of query refinement, a broader definition of query terms in the original query may inevitably produce many false positives, and thus invalidates some of the source selection results.
-
Compute the best data sources for a query using content summaries and query capabilities of data sources as a measure of server relevance.
This approach is popular in mediator-based systems. It characterizes data sources by their content and capability descriptions. Such source capability descriptions are then used to evaluate the relevance of a particular query. Those data sources having the capability to handle the query and returning non-empty results are then chosen as the best information servers for executing the query. There are two limitations for the capability-based approach. First, most of the existing proposals in the capability-base source selection rely on
manual collection of query capabilities and content description. When handling a large amount of data sources, manually creating and maintaining source capability profiles does not scale well.
Second, most of the existing proposals on capability-based approach suffer the problem of missing query refinement. Therefore, the accuracy of the capability profiles of data sources may not guarantee the best routing results. For instance, when a broader definition of query terms is used in the original query, query routing without query refinement may unavoidably produce large false positives.
-
Use source information quality measurements as the criteria for source selection.
The most commonly used information quality dimensions include accuracy, reputation, timeliness, completeness, understandability of the sources. A main drawback of this approach is the fact that most of these information quality measurements are user-specific and thus subjective. The quality of source selection will drop dramatically when the query users do not share the given set of information quality measurements.
-
The AQR approach
In AQR, we develop a framework for building an adaptive query routing system. There are two important features of our approach. First, we use both the user query profile and the source capability profile as a measure of server relevance. By using user query profiles for query refinement, a well-focused query is produced to ensure the matching data sources selected have zero or minimized false positives. By using the source capability profiles and a multi-level progressive pruning algorithm, the result of source selection will have zero or minimized false negatives. Second, we allow different options for query refinements and source selection to be mixed and matched within a single application, and provide performance monitoring to allow a system developer to examine the impact of using different query routing mechanisms.
Important to note is that different source selection mechanisms use different structures for source capability profiles, and require different pruning algorithms.
Multi-level Source Relevance Reasoning
- Step 1: Level-one relevance pruning -
This step discovers the candidate information sources whose content descriptions are related to the scope of a query Q (e.g., in terms of substring matching).
For level-one relevance pruning we use the user query scope description of Q and the content and category description of the sources.
The redundancy or replication of the sources will be detected and removed. Other factors such as unavailability of the sources or affordability of the sources should be considered at this step too. We call the set of sources selected by this step as target information sources of level-one relevance.
- Step 2: Level-two relevance pruning -
This step prunes the information sources that have level-one relevance but do not offer enough query capability to contribute to the answer of Q. This is due to either the restriction on the scope of query parameters of Q, or the restriction on the list of input or output arguments of the sources, or the conflict of query interest with the access constraints associated with the sources.
The user query capacity description of Q and the source capability profiles are used in the level-two relevance pruning. We call the set of sources selected by Step 2 as target information sources of level two relevance.
- Step 3: Level-three relevance pruning -
This step explores the relevance of the data sources with respect to the content specified in the query condition to further prune irrelevant information
sources to answer $Q$. For instance, a search on database conferences should not return those conferences that are in medicine. We call the set of sources selected by this step as target information sources of level three relevance.
- Step 4: Level-four relevance pruning -
This step explores run-time information to further prune irrelevant information
sources to answer $Q$. It is especially helpful when the subset of information sources resulting from level-two relevance pruning is not significantly small comparing with the large collection of available information sources. We call the set of sources selected by this step as target information sources of level four relevance.
The level-four relevance pruning is built as a plug-and-play component
and can be turned on or tuned to play any time when the need arises.