Internet Data Mining
| Sponsor |
Ling Liu / David Buttler
{lingliu, buttler}@cc.gatech.edu
223 / 260 CCB |
| Area |
Systems and Databases
Related Projects
|
Problem
The explosive growth of the Internet has become an overused cliche, yet the problems of information overload remain
as real as ever. Web search engines provide one way to manage the deluge of information on the Internet, but they
have some serious drawbacks for many applications. Common search engines do not index dynamic content; any URL
with a '?' is ignored. Neither do search engines provide finer granularity than a single HTML page. Their design
makes them unsuitable for comparison shopping or data integration.
The DISL group has constructed a powerful set of information extraction tools to work at solving some of these
problems. There are several remaining research challenges however. The following figure presents a simple
architecture for a dynamic search engine.
Within this framework there are several possible short proejcts suitable for a 7001 mini project, or an extended
Special Problems.
- Design and implement a robot crawler that discovers new dynamic search engine interfaces
- Design a technique to categorize a search engine by its contents (the pages that it dynamically generates),
the types of queries it responds to (query interface), or the context of the search interface.
- In conjunction with the categorization system, develop a user interface that assists users in selecting
the appropriate types of sources that are applicable to their query (see the
AQR project for an example static system)
- Improve the automated object extraction system. This may be broken down into individual projects by itself.
Currently, the automated object extraction system works in two phases: (1) identify the region of a dynamically
generated web page that contains data objects; (2) discover how the objects are separated (e.g. is there a
single tag that separates objects?), and use the separator to split the data region into objects.
Mini-projects in this area may include the following:
- Develop a new heuristic to identify where the data objects are; validate the effectiveness of the heuristic
- Develop a new heuristic to split the data region in to data objects; validate the effectiveness of the heuristic
- Implement a more sophisticated technique to combine individual heuristics to produce a better result,
either for the data region identification heuristics, or the object separtor discovery heuristics.
There are several interesting projects related with this topic. Please see either David or Prof. Ling Liu to
discuss other options.
Resources that may be helpful:
-
Local Java code library (convert an HTML file into a tree, automatically extract textual objects from a page, and more).
-
A Java framework to automatically run a heuristic over a large set of test web pages
-
set of web pages to test solutions, plus a method to evaluate whether a data-region heuristic or
an object separator heuristic succeeded on a given web page.
Background
You are expected to have a solid grasp of Java programming. Familiarity with XML is useful but not required.
Deliverables
A report describing the work you did and how you evaluate your results;
any source code you produced to accomplish your results.
Evaluation
You will be graded on the novelty and quality of your report and implementation.