<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>SEO Notes &#187; Web Crawlers</title>
	<atom:link href="http://www.seonotes.com/category/web-crawlers/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.seonotes.com</link>
	<description>Search Engine Optimization notes for Non-Techies!</description>
	<lastBuildDate>Mon, 09 Nov 2009 17:32:46 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.5</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Using Google&#8217;s Webmaster Tools</title>
		<link>http://www.seonotes.com/using-googles-webmaster-tools/</link>
		<comments>http://www.seonotes.com/using-googles-webmaster-tools/#comments</comments>
		<pubDate>Tue, 16 Jun 2009 18:31:30 +0000</pubDate>
		<dc:creator>Bhagwad</dc:creator>
				<category><![CDATA[Search Engines]]></category>
		<category><![CDATA[Web Crawlers]]></category>
		<category><![CDATA[search engine optimization]]></category>
		<category><![CDATA[seo notes]]></category>
		<category><![CDATA[seo tips]]></category>

		<guid isPermaLink="false">http://seonotes.com/?p=154</guid>
		<description><![CDATA[No professional SEO expert can afford to do without Google&#8217;s Webmaster tools. With Google being the dominant search engine that netizens use to find your sites, Google&#8217;s webmaster tools allow you to track how often and how much of your website the Google spiders are downloading.
Also, you can maintain a &#8220;sitemap&#8221; of your website that [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify;">No professional SEO expert can afford to do without Google&#8217;s Webmaster tools. With Google being the dominant search engine that netizens use to find your sites, Google&#8217;s webmaster tools allow you to track how often and how much of your website the Google spiders are downloading.</p>
<p style="text-align: justify;">Also, you can maintain a &#8220;sitemap&#8221; of your website that you submit to Google. The sitemap allows the spiders to reach even those URLs that are hidden away and are not easily accessible to spiders. In addition, you can &#8220;alert&#8221; Google in a variety of ways when you add or change content. You can remove outdated pages and prevent them from showing in Google&#8217;s index.</p>
<p style="text-align: justify;">Google&#8217;s Webmaster tools also allows you see if there are crawling errors when Google visits your site. Meta descriptions, Titles and descriptions are all covered in it&#8217;s ambit. And finally, it lets you see which keywords were detected on your site and how your site ranks for various phrases &#8211; though I prefer Google Analytics  &#8211; more on that later.</p>
<p style="text-align: justify;">For Google Webmaster Tools to work, you have to own the site or at least have read/write access to the files since it requires you to verify the site with either a meta tag or a specific HTML file.</p>
<img src="http://www.seonotes.com/?ak_action=api_record_view&id=154&type=feed" alt="" />]]></content:encoded>
			<wfw:commentRss>http://www.seonotes.com/using-googles-webmaster-tools/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Web Crawler &#8211; Parallelization Policy</title>
		<link>http://www.seonotes.com/web-crawler-parallelization-policy/</link>
		<comments>http://www.seonotes.com/web-crawler-parallelization-policy/#comments</comments>
		<pubDate>Wed, 19 Apr 2006 09:14:38 +0000</pubDate>
		<dc:creator>seonotes</dc:creator>
				<category><![CDATA[Web Crawlers]]></category>
		<category><![CDATA[seo notes]]></category>

		<guid isPermaLink="false">http://seonotes.com/?p=24</guid>
		<description><![CDATA[A Crawler that runs multiple processes in parallel is a parallel crawler.
It aims towards maximizing the download rate and in this process, it minimizes the overhead from parallelization to prevent repeated downloads of the same page.
The Crawling system requires a policy to avoid downloading the same page twice. The policy assigns the new URLs discovered [...]]]></description>
			<content:encoded><![CDATA[<p>A Crawler that runs multiple processes in parallel is a parallel crawler.</p>
<p>It aims towards maximizing the download rate and in this process, it minimizes the overhead from parallelization to prevent repeated downloads of the same page.</p>
<p>The Crawling system requires a policy to avoid downloading the same page twice. The policy assigns the new URLs discovered during the crawling processes.</p>
<p>Dynamic assignment: Dynamic assignment helps a central server to assign new URLs to different crawlers in dynamic manners. This, in a way facilitates the central server to balance the load of each crawler.</p>
<p>The systems can efficiently add or remove down loader processes with the help of dynamic assignment. Since, in this case the central server becomes the bottleneck, it is very essential to transfer a large part of workload to the distributed crawling processes.</p>
<p>There are two main configurations of crawling architectures with dynamic assignments.</p>
<p>These configurations are well described by Shkapenyuk and suel in 2002.</p>
<p>a) A small crawler configuration: In this type of a configuration, there is a central DNS resolver and central queues per website and distributed down loaders.<br />
b) A Larger crawler configuration: In this type of crawler configuration, both the DNS resolver and the queues are distributed.<br />
c) Static assignment: There is a fixed rule in this type of policy. This defines how to assign new URLs to the crawlers.</p>
<p>In this policy, a hashing function can be utilized to transform URLs into a number that matches the index of the related crawling process.</p>
<p>It is necessary to exchange the URLs between crawling processes in batch. This reduces the overhead due to the exchange of URLs between crawling processes.</p>
<p>Three main properties constitute an effective assignment. They are: balancing property, contra-variance property and boldi et al.</p>
<img src="http://www.seonotes.com/?ak_action=api_record_view&id=24&type=feed" alt="" />]]></content:encoded>
			<wfw:commentRss>http://www.seonotes.com/web-crawler-parallelization-policy/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Crawler Revisit Policy</title>
		<link>http://www.seonotes.com/crawler-revisit-policy/</link>
		<comments>http://www.seonotes.com/crawler-revisit-policy/#comments</comments>
		<pubDate>Wed, 19 Apr 2006 09:12:33 +0000</pubDate>
		<dc:creator>seonotes</dc:creator>
				<category><![CDATA[Web Crawlers]]></category>
		<category><![CDATA[seo notes]]></category>

		<guid isPermaLink="false">http://seonotes.com/?p=23</guid>
		<description><![CDATA[Gifted with a powerful and dynamic nature, the process of web crawling may consume a good amount of time (months or weeks). Many events such as creations, updates and deletions occur during the process of web crawling. According to a search engine, there is a cost related to not deleting an event. Freshness and age [...]]]></description>
			<content:encoded><![CDATA[<p>Gifted with a powerful and dynamic nature, the process of web crawling may consume a good amount of time (months or weeks). Many events such as creations, updates and deletions occur during the process of web crawling. According to a search engine, there is a cost related to not deleting an event. Freshness and age are the most used cost functions.</p>
<p>Freshness: This indicates the accuracy of the local copy.</p>
<p>The formula depicted below defines the freshness of a page &#8216;S&#8217; in the repository at time &#8216;t&#8217;.</p>
<p>Fs (t) = {1 If S is equal to the local copy at time t.</p>
<p>		0 Otherwise.}</p>
<p>Age: indicates how outdated the local copy is.</p>
<p>The age of page &#8216;K&#8217; in the repository, at time &#8216;B&#8217; is defined by the formula given below.</p>
<p>A k (b) = {O If k is not modified at time b</p>
<p>		 B &#8211; modification time of b otherwise.}</p>
<p>Evolution of freshness and age in web crawling:</p>
<p>Edward G. Coffman used a different wording to define the objective of a web crawler that is equivalent to freshness. The analyzation resulted in a conclusion that a crawler must minimize the fraction of time pages remain outdated. They observed that the problem of web crawling can be modeled as a multiple-queue, single-server polling system, on which the web crawler is the server and the website are the queues.</p>
<p>The arrival of the customers are considered as page modifications and switchover times are the interval between page accesses to a single website. The waiting time for a customer in the polling system is equivalent to the average age for the web crawler.</p>
<p>Cho and Garcia Molina in 2003 studied two simple re-visiting policies as depicted below:</p>
<p>a)	Uniform policy: The policy involves revisiting all pages in the collection with the same frequency, irrespective of their rates of change.<br />
b)	Proportional policy: The Policy includes the process of re-visiting more often the pages that change frequently.</p>
<p>Cho and gracia-Molina came out with an astounding result. They declared that the uniform policy out-performs the proportional policy in both a real web and stimulated web crawl. The explanation was based on the fact that frequent page changes wastes the crawler&#8217;s time by trying to re-crawl it too fast.</p>
<p>They emphasized on penalizing the elements that frequently change, to enhance freshness.</p>
<p>This is known as the optimal re-visiting policy that can neither be categorized under uniform policy nor proportional policy.</p>
<p>This is actually the best policy to keep the average freshness. The policy includes ignoring the pages that changes too often.</p>
<p>In this case the optimal is closer to the uniform policy. The revisiting policies here treat all pages as homogenous in terms of quality.</p>
<p>Politeness Policy: Koster found that using web robots is beneficial for a number of tasks. This accompanies a price to be paid for the general community.</p>
<p>The Costs include:</p>
<p>a)	Network Resources: Robots require a considerable bandwidth.<br />
b)	Server overload: When the frequency of accesses to a given server is high.<br />
c)	Poorly written robots: can crash servers or routers.<br />
d)	Disrupted network and web servers: If too many users deplay personal robots.</p>
<p>Robots.txt protocol is a solution to the above-mentioned problems.</p>
<img src="http://www.seonotes.com/?ak_action=api_record_view&id=23&type=feed" alt="" />]]></content:encoded>
			<wfw:commentRss>http://www.seonotes.com/crawler-revisit-policy/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

