<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: QDF - WTF&#8230;?</title>
	<atom:link href="http://www.silverspike.co.uk/2007/06/15/qdf-wtf/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.silverspike.co.uk/2007/06/15/qdf-wtf/</link>
	<description>The Official SilverDisc Blog</description>
	<pubDate>Fri, 21 Nov 2008 03:08:25 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6.1</generator>
		<item>
		<title>By: alan</title>
		<link>http://www.silverspike.co.uk/2007/06/15/qdf-wtf/#comment-4</link>
		<dc:creator>alan</dc:creator>
		<pubDate>Fri, 13 Jul 2007 11:38:47 +0000</pubDate>
		<guid isPermaLink="false">http://www.silverspike.co.uk/2007/06/15/qdf-wtf/#comment-4</guid>
		<description>&gt; Given that some content doesn’t change, I would argue that it is prudent to not bother downloading such pages very often

True, but that's why the "If-Modified-Since" HTTP request header exists.  The SERPs for any query consist of fresh results, i.e. results that are likely to still exist and be what the robot saw.  That's not to say that the results need to be recently created - just recently verified.

&gt; Maybe, rather than freshness, it is better thought of as ...  “document accuracy”, i.e. the likelihood that the data stored for a document is accurate,

That's exactly what's at issue.</description>
		<content:encoded><![CDATA[<p>> Given that some content doesn’t change, I would argue that it is prudent to not bother downloading such pages very often</p>
<p>True, but that&#8217;s why the &#8220;If-Modified-Since&#8221; HTTP request header exists.  The SERPs for any query consist of fresh results, i.e. results that are likely to still exist and be what the robot saw.  That&#8217;s not to say that the results need to be recently created - just recently verified.</p>
<p>> Maybe, rather than freshness, it is better thought of as &#8230;  “document accuracy”, i.e. the likelihood that the data stored for a document is accurate,</p>
<p>That&#8217;s exactly what&#8217;s at issue.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: mike</title>
		<link>http://www.silverspike.co.uk/2007/06/15/qdf-wtf/#comment-3</link>
		<dc:creator>mike</dc:creator>
		<pubDate>Sat, 16 Jun 2007 03:33:42 +0000</pubDate>
		<guid isPermaLink="false">http://www.silverspike.co.uk/2007/06/15/qdf-wtf/#comment-3</guid>
		<description>“Query deserves freshness”? I thought that all queries deserved freshness! A stale index results in poor quality SERPs that could lead to lots of pages that no longer exist or have been changed since they were indexed. A fresh index solves this problem to a great extent.
I disagree that all queries deserve the "Last crawled date" freshness, because it pits two goals (fresh index and effective utilisation of crawl resources) against each other.

Some content will not to change much. Take the W3C HTML stanadards:
http://www.w3.org/TR/html401/

Given that some content doesn't change, I would argue that it is prudent to not bother downloading such pages very often. In terms of bandwidth and useful utilisation of a parser (which, with ever increasing elements to uncover, is probably taking more CPU cycles than ever), the resources spent grabbing an unchanging page are better spent grabbing either new content for the index, or refreshing content that changes regularly, like the front page of Digg or news sites.

Reconciling an algorithm that optimises the utilisation of crawl resources, that likely discriminates against primary sources that never change, with a usage of "last crawled date" in rankings would be problematic, especially when the search clearly wants the primary source. 

Sure, you could make the last crawl date for such pages constantly NOW, but that makes the ability to optimise crawling schedules (which would be as close to 1 second after publishing as possible :)) potentially harmful to results. Optimising in two direction (crawl effieciency and relevance) is terribly difficult.

Maybe, rather than freshness, it is better thought of as "staleness" or "document accuracy", i.e. the likelihood that the data stored for a document is accurate, with staleness the moment at which trust in the accuracy of the indexed copy moves from accurate reflection to not an accurate reflection.

Freshness, per se, doesn't really matter, but accuracy does. Such a measure would aslo alow different documents to loss the trust in teh acuracy of the indexed copy at different rates (a news site's home page, for example, would likely be innaccurate after 10-15 minutes; a side with every legal judgement likely has a lot longer decay cycle).</description>
		<content:encoded><![CDATA[<p>“Query deserves freshness”? I thought that all queries deserved freshness! A stale index results in poor quality SERPs that could lead to lots of pages that no longer exist or have been changed since they were indexed. A fresh index solves this problem to a great extent.<br />
I disagree that all queries deserve the &#8220;Last crawled date&#8221; freshness, because it pits two goals (fresh index and effective utilisation of crawl resources) against each other.</p>
<p>Some content will not to change much. Take the W3C HTML stanadards:<br />
<a href="http://www.w3.org/TR/html401/" rel="nofollow">http://www.w3.org/TR/html401/</a></p>
<p>Given that some content doesn&#8217;t change, I would argue that it is prudent to not bother downloading such pages very often. In terms of bandwidth and useful utilisation of a parser (which, with ever increasing elements to uncover, is probably taking more CPU cycles than ever), the resources spent grabbing an unchanging page are better spent grabbing either new content for the index, or refreshing content that changes regularly, like the front page of Digg or news sites.</p>
<p>Reconciling an algorithm that optimises the utilisation of crawl resources, that likely discriminates against primary sources that never change, with a usage of &#8220;last crawled date&#8221; in rankings would be problematic, especially when the search clearly wants the primary source. </p>
<p>Sure, you could make the last crawl date for such pages constantly NOW, but that makes the ability to optimise crawling schedules (which would be as close to 1 second after publishing as possible :)) potentially harmful to results. Optimising in two direction (crawl effieciency and relevance) is terribly difficult.</p>
<p>Maybe, rather than freshness, it is better thought of as &#8220;staleness&#8221; or &#8220;document accuracy&#8221;, i.e. the likelihood that the data stored for a document is accurate, with staleness the moment at which trust in the accuracy of the indexed copy moves from accurate reflection to not an accurate reflection.</p>
<p>Freshness, per se, doesn&#8217;t really matter, but accuracy does. Such a measure would aslo alow different documents to loss the trust in teh acuracy of the indexed copy at different rates (a news site&#8217;s home page, for example, would likely be innaccurate after 10-15 minutes; a side with every legal judgement likely has a lot longer decay cycle).</p>
]]></content:encoded>
	</item>
</channel>
</rss>
