A Crawler that runs multiple processes in parallel is a parallel crawler.
It aims towards maximizing the download rate and in this process, it minimizes the overhead from parallelization to prevent repeated downloads of the same page.
The Crawling system requires a policy to avoid downloading the same page twice. The policy assigns the new URLs discovered during the crawling processes.
Dynamic assignment: Dynamic assignment helps a central server to assign new URLs to different crawlers in dynamic manners. This, in a way facilitates the central server to balance the load of each crawler.
The systems can efficiently add or remove down loader processes with the help of dynamic assignment. Since, in this case the central server becomes the bottleneck, it is very essential to transfer a large part of workload to the distributed crawling processes.
There are two main configurations of crawling architectures with dynamic assignments.
These configurations are well described by Shkapenyuk and suel in 2002.
a) A small crawler configuration: In this type of a configuration, there is a central DNS resolver and central queues per website and distributed down loaders.
b) A Larger crawler configuration: In this type of crawler configuration, both the DNS resolver and the queues are distributed.
c) Static assignment: There is a fixed rule in this type of policy. This defines how to assign new URLs to the crawlers.
In this policy, a hashing function can be utilized to transform URLs into a number that matches the index of the related crawling process.
It is necessary to exchange the URLs between crawling processes in batch. This reduces the overhead due to the exchange of URLs between crawling processes.
Three main properties constitute an effective assignment. They are: balancing property, contra-variance property and boldi et al.

No Comment
Sorry the comment area are closed