|
|
|
Organization Name:
College of Computing, Georgia Tech
Principal Investigator:
Calton Pu
Co-PI and Technical Manager:
Ling Liu
The first challenge is the heterogeneity of data sources and consumer requirements. There is a need for new technology to intelligently process, interpret, fuse, combine, and abstract useful knowledge from these heterogeneous data sources, sources which use different interfaces, query languages, data structures, terminology, and semantics to represent the same or similar information.
The second challenge is the requirement in data quality, reliability, and up-to-dateness that must be maintained throughout the information flow from producers to consumers. Consider the logistic application scenario of a joint operation in East Africa. As supplies are loaded and transported from their current location around to world to the theater of operations, the logistic anchor desk must keep track of their progress and current location. This is a time-critical task since the logistic anchor desk must have up-to-date information for planning as well as command and control. In this situation, the periodical update policy is of limited value; even dynamically changeable periodicity will only alleviate the problem. Thus, there is a need for new technology to effectively and efficiently monitor information updates.
The third challenge is the query responsiveness and fault tolerance in the presence of slow delivery and availability problems caused by netwok congestion and source contention. For example, in logistic applications, it is very important to propagate partial results when sources are unavailable or the network is degraded. This problem is aggravated when we take into account the quality of information requirements of the second challenge. A complementary problem is the handling of conflicts and update merging when new sources join the network, or when old sources recovers from crash. Thus, there is a need for dynamic query evaluation techniques that can incorporate run-time information to revise the query plan when some sources or intermediate sites are unavailable or have slow delivery frequency.
A continual query can be posed over multiple heterogeneous data sources. It is quite common that at any given time point, some sources may be changed and others remain the same with respect to the previous time point. It is in general expensive to re-evaluate the whole query over the entire data sources for each execution of a continual query, although in some circumstances it may be unavoidable to reprocess the query from scratch. Therefore, it is important to find optimization methods that can bypass the complete re-evaluation to avoid duplicate computation and unnecessary data transmission over the networks.
General Updates:
An important distinction between CQs and previous work is the
handling of general updates, including insertions, deletions, and
modifications. For example, let us assume that a job opening database
is updated whenever a posted position is filled (deletion)
or a new job posting arrives (insertion).
Some users may desire to be informed of both the incoming new
openings from some particular type of companies and the positions that
were available before (e.g., in the previous execution of the query)
but have been recently filled or canceled due to budget disapproval.
Simply recomputing the query for each
execution does not guarantee the correct results, because conventional
SQL query manager can only filter data currently matching the
query.
To provide adequate support for flexible incorporation of
updates, the CQ manager must exploit the knowledge implied by both the
query expressions and database update operations.
CQ Evaluation: Another important problem
in monitoring of information updates is the cost of
query evaluation. A naive query evaluation strategy would recompute
the same query from scratch for each execution request and
highlight the
difference. In the Internet environment, such strategy can easily
overwhelm the users and overload the system and networks,
especially when a large number of records match the query,
but only a small fraction is changed each time, or when
some users want to see only the data that have
been updated since the last query result.
Incremental Algorithms:
We have proposed the
differential re-evaluation algorithm (DRA) for processing
CQs. The solution we promote avoids the overhead of
recomputing the query from scratch, and relaxes the restrictions of
append-only database and monotonic-transformable queries. Our design
builds on the research in differential files,
hypothetical relations, and incremental updating materialized
views, and offers several advantages over the previous
solutions. The basic idea is to compute the result of the
current execution of a continual query by combining the result
produced by the most recent execution with the updates occurred
in between these two consecutive executions. We use
differential relations to record changes to the source relations (or
materialized views). We define a set of differential operators to
compute the updates occurred in between two consecutive
executions of a continual query.
Performance Tuning Through Specializations:
An important problem in the processing of continual queries is the
varied performance of different update propagation and query
evaluation algorithms for different environments, since
different update propagation and CQ query processing algorithms
work best in different scenarios. We will evaluate these algorithms
according to their capabilities and use specialization techniques
developed in the Synthetix project to improve the overall system
performance. Intuitively, specialization says that we should use
the best algorithm for each specific situation we can characterize.
For example, in this project we will implement different
algorithms for the different scenarios such as those given below.
The main two variables are the number of queries on the data and the update rate. It is intuitive that an algorithm that takes advantage of low update rates (the first case below) probably will not work well when that assumption is not true. Instead of trying to come up with a general algorithm for all cases, we will apply our experience with specialization to accommodate these kinds of algorithms.
Combining Client Pull with Server Push:
Continual queries can be seen as a hybrid mode of data delivery which
combines the client-pull and server-push mechanisms, namely,
the transfer of information from servers to clients is first initiated
by a client pull and the subsequent transfer of updated information
to clients is initiated by a server push.
In the CQ project, we will explore a number of ways to
provide more flexible combinations of
client-pull and server-push, including
monitoring changes in information sources
by incremental organization and
management of client caches based on query specification;
providing a flexible and user-friendly CQ adapter interface
for installation of distributed continual query services;
developing strategies for dynamic interaction of ad-hoc query
answering and smart (semantic) query caching; and
incorporating services such as prefetching, update dissemination,
and balancing push-based and pull-based information access into
the CQ system architecture.
cq@cc.gatech.edu