Gifted with a powerful and dynamic nature, the process of web crawling may consume a good amount of time (months or weeks). Many events such as creations, updates and deletions occur during the process of web crawling. According to a search engine, there is a cost related to not deleting an event. Freshness and age are the most used cost functions.
Freshness: This indicates the accuracy of the local copy.
The formula depicted below defines the freshness of a page ‘S’ in the repository at time ‘t’.
Fs (t) = {1 If S is equal to the local copy at time t.
0 Otherwise.}
Age: indicates how outdated the local copy is.
The age of page ‘K’ in the repository, at time ‘B’ is defined by the formula given below.
A k (b) = {O If k is not modified at time b
B – modification time of b otherwise.}
Evolution of freshness and age in web crawling:
Edward G. Coffman used a different wording to define the objective of a web crawler that is equivalent to freshness. The analyzation resulted in a conclusion that a crawler must minimize the fraction of time pages remain outdated. They observed that the problem of web crawling can be modeled as a multiple-queue, single-server polling system, on which the web crawler is the server and the website are the queues.
The arrival of the customers are considered as page modifications and switchover times are the interval between page accesses to a single website. The waiting time for a customer in the polling system is equivalent to the average age for the web crawler.
Cho and Garcia Molina in 2003 studied two simple re-visiting policies as depicted below:
a) Uniform policy: The policy involves revisiting all pages in the collection with the same frequency, irrespective of their rates of change.
b) Proportional policy: The Policy includes the process of re-visiting more often the pages that change frequently.
Cho and gracia-Molina came out with an astounding result. They declared that the uniform policy out-performs the proportional policy in both a real web and stimulated web crawl. The explanation was based on the fact that frequent page changes wastes the crawler’s time by trying to re-crawl it too fast.
They emphasized on penalizing the elements that frequently change, to enhance freshness.
This is known as the optimal re-visiting policy that can neither be categorized under uniform policy nor proportional policy.
This is actually the best policy to keep the average freshness. The policy includes ignoring the pages that changes too often.
In this case the optimal is closer to the uniform policy. The revisiting policies here treat all pages as homogenous in terms of quality.
Politeness Policy: Koster found that using web robots is beneficial for a number of tasks. This accompanies a price to be paid for the general community.
The Costs include:
a) Network Resources: Robots require a considerable bandwidth.
b) Server overload: When the frequency of accesses to a given server is high.
c) Poorly written robots: can crash servers or routers.
d) Disrupted network and web servers: If too many users deplay personal robots.
Robots.txt protocol is a solution to the above-mentioned problems.
Popularity: 7% [?]
Subscribe via feeds
No Comment
Random Post
Leave Your Comments Below