Internet Data Mining

Sponsor Ling Liu / David Buttler
{lingliu, buttler}@cc.gatech.edu
223 / 260 CCB

Area Systems and Databases
Related Projects

Problem
The explosive growth of the Internet has become an overused cliche, yet the problems of information overload remain as real as ever. Web search engines provide one way to manage the deluge of information on the Internet, but they have some serious drawbacks for many applications. Common search engines do not index dynamic content; any URL with a '?' is ignored. Neither do search engines provide finer granularity than a single HTML page. Their design makes them unsuitable for comparison shopping or data integration.

The DISL group has constructed a powerful set of information extraction tools to work at solving some of these problems. There are several remaining research challenges however. The following figure presents a simple architecture for a dynamic search engine.

Within this framework there are several possible short proejcts suitable for a 7001 mini project, or an extended Special Problems.

Design and implement a robot crawler that discovers new dynamic search engine interfaces
Design a technique to categorize a search engine by its contents (the pages that it dynamically generates), the types of queries it responds to (query interface), or the context of the search interface.
In conjunction with the categorization system, develop a user interface that assists users in selecting the appropriate types of sources that are applicable to their query (see the AQR project for an example static system)
Improve the automated object extraction system. This may be broken down into individual projects by itself.

Currently, the automated object extraction system works in two phases: (1) identify the region of a dynamically generated web page that contains data objects; (2) discover how the objects are separated (e.g. is there a single tag that separates objects?), and use the separator to split the data region into objects.

Mini-projects in this area may include the following:
- Develop a new heuristic to identify where the data objects are; validate the effectiveness of the heuristic
- Develop a new heuristic to split the data region in to data objects; validate the effectiveness of the heuristic
- Implement a more sophisticated technique to combine individual heuristics to produce a better result, either for the data region identification heuristics, or the object separtor discovery heuristics.

There are several interesting projects related with this topic. Please see either David or Prof. Ling Liu to discuss other options.

Resources that may be helpful:

Local Java code library (convert an HTML file into a tree, automatically extract textual objects from a page, and more).
A Java framework to automatically run a heuristic over a large set of test web pages
set of web pages to test solutions, plus a method to evaluate whether a data-region heuristic or an object separator heuristic succeeded on a given web page.

Background

You are expected to have a solid grasp of Java programming. Familiarity with XML is useful but not required.

Deliverables

A report describing the work you did and how you evaluate your results; any source code you produced to accomplish your results.

Evaluation
You will be graded on the novelty and quality of your report and implementation.

Sponsor	Ling Liu / David Buttler {lingliu, buttler}@cc.gatech.edu 223 / 260 CCB
Area	Systems and Databases Related Projects