A Web Crawler is a program that browses the worldwide web in a typically arranged, ordered and mechanized manner. A web crawler is also popularly known as a web spider or ant.
Web crawlers create a copy of all the frequently visit sites. The copies are further used by search engine. The search engine indexes the downloaded pages to expidite and accelerated the search process.
Web crawlers are efficient in mechanizing tasks related to maintenance on a website, such as checking links or validating HTML code. Web crawlers also accumulate information from web pages.
A web crawler can be defined as one type of bot, or software agent. Initially the process starts with a list of URLs to visit. As soon as it pays visit to these listed URLs, it recognizes all the hyperlinks in the page and adds them to the list of URLs to visit.
Crawling policies:
The process of web crawling becomes difficult due to two notable characteristics of the web.
” The large web volume
” Its rate of change
Since, there are large numbers of pages being constantly added, eliminated and changed each day, the web crawling becomes really difficult.
The large volume: This refers to the fact that the web crawler is allowed to download only a small number of the web pages within an allotted time slot. This makes it necessary for the web crawler to prioritize its downloads.
Rate of change: This refers to the process of addition of new pages to the site by the time the crawler is downloading the last pages each step which pages to visit next.
A combination of policies are responsible for the behavior of a web crawler. The policies are as mentioned below.
” A selection policy: It defines the specific type of pages to download.
” A re-visit policy: states when to check for changes to the pages.
” A politeness policy: refers to the technique of avoiding over loading websites.
” A parallelization policy: states how to coordinate distributed web crawlers.
Popularity: 11% [?]

No Comment
Sorry the comment area are closed