The complexity and the large scale of the world wide web (WWW) has fueled the development of search tools such as Yahoo, the Harvest Web Broker [6], and Lycos [22]. These tools are sufficient for simple keyword-based searching but the results contain simple information and are represented as HTML documents which impedes further processing. There is a need for tools which can assist consumers in the merging, the combining, and the sorting of retrieved results from multiple sources. Currently, users who wish to gather information from various tools need to do so by visiting each tool separately and then assembling the results manually. By creating wrappers to each information tool, Diorama allows users to merge, combine, and sort their search results through a single query. To our knowledge, no similar tool has yet been encountered during the time of this writing.
Since most of these information sources only allow network (HTTP) access, our wrapper is not physically located on the server machine of the tool. The link from our wrapper to the source is implemented through third party software ( get_url.pl Perl Library) which retrieves output from remote HTML pages and from remote CGI scripts.
The following generic wrapper function diom_wrapper_Select will be used as the running example to illustrate reusability in DIOM wrapper construction and generation.
sub diom_wrapper_Select {
local($result);
## Fetch Data
$result=&local_ExternalFetch;
$verbose &&
print "RAW PAGE ======\n<pre>\n $result \n</pre>".
" \n===== End RAW PAGE\n";
## Normalization
$tablename=&local_ExternalNormalize ($result);
## process normalize data
$return=&local_Select ($tablename);
## clean up tmp tables
&local_Close ($tablename);
## return answer
$return;
}
The code above is written for semi-structured network access tool wrappers. The function diom_wrapper_Select first fetches the data from the network information tool, normalizes it inside the wrapper DBMS by calling the local wrapper functions local_ExternalFetch and local_ExternalNormalize, then performs the necessary selections on the data based on the condition and returns the result as a packaged DIOM object (MIME type).
The local wrapper function local_ExternalFetch below composes the correct URL to the information tool, grabs the web page, and returns the raw result up to the Select function.
sub local_ExternalFetch {
## Template for Web Pages
local($url);
## compose correct URL
$url=&HTML_makeURL;
($verbose) && (print "URL: $url\n");
## grab web page
eval "\$result=&GrabWebPage(\$url)" ||
die "Page Grab failed\n";
($verbose) && (print "Page Grab SUCCESS\n");
$result;
}
The local_ExternalNormalize function below takes the raw result and normalizes it into the DIOM internal object representation.
sub local_ExternalNormalize {
local($result)=@_;
## create temporary table name to use
$tablename=&local_MakeTableName;
## create normalization commands
@insertCommands =
&HTML_translate($tablename,$result);
## print commands
$verbose && print "INSERT COMMANDS ".
"=======\n@insertCommands\n===== END INSERT ".
"COMMANDS\n";
## execute commands to insert into table
eval "&insertScratch (\$tablename, ".
"\@insertCommands)" || print "ERROR with ".
"creating scratch table and data";
# return $tablename;
$tablename;
}
Let us take the wrapper to Yahoo as a concrete example. The formulation of the Yahoo URL starts off with the web host and path to the cgi script:
source: http://search.yahoo.com/bin/search?
To complete the Yahoo URL and get a correct URL we have to embed the proper functionality within the search string using the following notation:
p=<keywords> (separated by +)
d=(ignore this) (yahoo, usenet, email)
s= o|a (or / and)
w= s (partial) | w (exact) (word matching partial / exact)
n=<number> (number of results.. 100 max for Yahoo)
Here are some correct Yahoo URL examples of searching for the keywords job and multimedia:
http://search.yahoo.com/bin/search?p=job+multimedia
http://search.yahoo.com/bin/search?p=job+multimedia&d=y&s=o&w=s&n=25
The HTML_makeURL function called in the wrapper function local_ExternalFetch is a source-dependent wrapper function which creates the proper URL to Yahoo. It was created by taking the above Yahoo URL formation knowledge and creating a Perl script to manipulate the condition field given in the call to wrapper. The HTML_makeURL code for Yahoo is provided in Appendix A.
The HTML_translate function called in the wrapper function local_ExternalNormalize is also a source-dependent function which extracts the URL and description from the raw Yahoo HTML result where a single record has the following format:
<li><a href="http://mlds-www.arc.nasa.gov/BAMTA/">BAMTA <b>Job</b>
Bank</a> - <b>multimedia</b> and Web technology related positions.
It can be seen that if each record in the HTML raw result has the same format as the one give above, then parsing for each such record will allow for the extraction of data. The HTML_translate is based on the format of Yahoo return records. Appendix A provides a detailed coding of HTML_translate for Yahoo.
By using the diom_wrapper_Select function as an example, we have illustrated the role of generic wrapper functions and the source-dependent wrapper functions. For any HTML tool only two local functions need to be defined and reimplemented. They are HTML_makeURL and HTML_translate functions. Thus, generating wrappers for HTML-based information tools becomes a two-function creation process. Eventually, these function creation steps can be automated using a well designed interface for entering the URL description and a description of how the HTML record is formatted.