Internet Data Mining


Sponsor Ling Liu / David Buttler
{lingliu, buttler}@cc.gatech.edu
223 / 260 CCB
Area Systems and Databases
Related Projects


Problem
The explosive growth of the Internet has become an overused cliche, yet the problems of information overload remain as real as ever. Web search engines provide one way to manage the deluge of information on the Internet, but they have some serious drawbacks for many applications. Common search engines do not index dynamic content; any URL with a '?' is ignored. Neither do search engines provide finer granularity than a single HTML page. Their design makes them unsuitable for comparison shopping or data integration.

The DISL group has constructed a powerful set of information extraction tools to work at solving some of these problems. There are several remaining research challenges however. The following figure presents a simple architecture for a dynamic search engine.

Within this framework there are several possible short proejcts suitable for a 7001 mini project, or an extended Special Problems.

  1. Design and implement a robot crawler that discovers new dynamic search engine interfaces
  2. Design a technique to categorize a search engine by its contents (the pages that it dynamically generates), the types of queries it responds to (query interface), or the context of the search interface.
  3. In conjunction with the categorization system, develop a user interface that assists users in selecting the appropriate types of sources that are applicable to their query (see the AQR project for an example static system)
  4. Improve the automated object extraction system. This may be broken down into individual projects by itself.

    Currently, the automated object extraction system works in two phases: (1) identify the region of a dynamically generated web page that contains data objects; (2) discover how the objects are separated (e.g. is there a single tag that separates objects?), and use the separator to split the data region into objects.

    Mini-projects in this area may include the following:

There are several interesting projects related with this topic. Please see either David or Prof. Ling Liu to discuss other options.

Resources that may be helpful:


Background

You are expected to have a solid grasp of Java programming. Familiarity with XML is useful but not required.


Deliverables

A report describing the work you did and how you evaluate your results; any source code you produced to accomplish your results.

Evaluation
You will be graded on the novelty and quality of your report and implementation.