Introduction

The introduction of a crawler in mid 90s opened the floodgates for research in various application domains. Many attempts to create an ideal crawler failed due to the explosive nature of the web. We describe the building blocks of PeerCrawl - a Peer-to-Peer web crawler. This crawler can be used for generic crawling (crawling the WWW), is easily scalable and can be implemented on a grid of day-to-day use computers. Later on we will show that this crawler, can be easily tweaked to create a focused crawler. This is very useful in crawling and especially archiving particular web domains. Communication amongst the peers is done via Gnutella, an open source P2P library.

We have also built a GUI package called PeerCrawl Analyzer to analyze the statistics of a crawl generated from log files of the crawler. This can used to study various aspects of a crawl ranging from crawling rate to the types of documents crawled.

top

Why PeerCrawl?

P2P systems have demonstrated their scalability and versatile usage through numerous applications. The main advantage of P2P systems is that they can be implemented on a grid of normal day-to-day use computers. This proves as a lucrative idea for an exhaustive task like web crawling. Also, creation of a web archive can be done in a distributed fashion due to the inherent nature of P2P systems. Later on we can reduce the latencies while obtaining web pages from a server by exploiting the geographical distribution of P2P systems.

top

How it Works?

The architecture of a single peer in PeerCrawl consists of many threads of control, which perform various tasks ranging from fetching a document for a web server to maintaining bookkeeping information. All these threads are autonomous, but they could share some data structures between other threads, thus leading to various synchronization issues.

Typically, the user enters a seed list of urls to start the crawler. One of the threads picks up a url and fetches the page from its web server. Then, another thread parses it for urls to other documents (html pages/pdf/etc..). Concurrently, other threads perform tasks like maintaining statistics, caching to disk amongst others.

Initially, the entire address space is allocated to root node (startup peer). As more peers join/leave the network, the IP address range is equally divided depending on the number of peers. Communication via Gnutella layer gives the information of various peers joining/leaving the network. Works on coordinating peers to follow a focused crawler are currently underway.

top

Challenges

Nowadays, web pages have links to varied forms of media such as documents, images, presentations, etc. One of the biggest challenges is to crawl maximum possible types of documents. This is an important step in achieving totality in crawling. Also, other issues like speeding up the crawler, following various schemes (like politeness policy) also face an interesting challenge.

top