The Silver Spike | The Official SilverDisc Blog

Jun/07

15

QDF – WTF…?

Search engine freshness is an issue that’s always been of interest to me … certainly since the very early days of my search marketing adventures, in the mid-1990s. It was of such interest that it lies at the core of one of my patents, “Process for maintaining ongoing registration for pages on a given search engine“.

I would define search engine freshness as follows:

freshness
An indication of how recently a search engine has crawled and indexed a page, a part of its index or its index as a whole

I think most search engine engineers would define it in a similar way. For example Matt Cutts, in his blog post “Measuring Freshness” from September 2005, stated

The authors tracked Google, Yahoo, and MSN over 42 days using 38 German webpages that were updated daily and that included a datestamp somewhere on the page. They measured freshness by looking at each search engine’s cached page to see how up-to-date the page was. If you measure success by having a version of a page within 0 or 1 days, Google succeeded a little under 83% of the time, MSN succeeded 48% of the time, and Yahoo succeeded about 42% of the time.

So I’m surprised to find the following quote in the recent, much-publicised New York Times article “Google Keeps Tweaking Its Search Engine“:

So [Mr. Singhal] monitors complaints on his white board, prioritizing them if they keep coming back. For much of the second half of last year, one of the recurring items was “freshness.”

Freshness, which describes how many recently created or changed pages are included in a search result, is at the center of a constant debate in search: Is it better to provide new information or to display pages that have stood the test of time and are more likely to be of higher quality? Until now, Google has preferred pages old enough to attract others to link to them.

But last year, Mr. Singhal started to worry that Google’s balance was off. When the company introduced its new stock quotation service, a search for “Google Finance” couldn’t find it. After monitoring similar problems, he assembled a team of three engineers to figure out what to do about them.

Earlier this spring, he brought his squad’s findings to Mr. Manber’s weekly gathering of top search-quality engineers who review major projects. At the meeting, a dozen people sat around a large table, another dozen sprawled on red couches, and two more beamed in from New York via video conference, their images projected on a large screen. Most were men, and many were tapping away on laptops. One of the New Yorkers munched on cake.

Mr. Singhal introduced the freshness problem, explaining that simply changing formulas to display more new pages results in lower-quality searches much of the time. He then unveiled his team’s solution: a mathematical model that tries to determine when users want new information and when they don’t. (And yes, like all Google initiatives, it had a name: QDF, for “query deserves freshness.”)

“Query deserves freshness”? I thought that all queries deserved freshness! A stale index results in poor quality SERPs that could lead to lots of pages that no longer exist or have been changed since they were indexed. A fresh index solves this problem to a great extent.

This appears to be a new, different use of the word freshness in relation to search engines, referring not to the last indexed date of a page, but to the last modified date of a page – something very different!

Matt Cutts referred to this NYT article in “Five things you didn’t know about Google’s search“, without mentioning this variation in the use of the word Freshness, which I find odd – especially as elsewhere in the article, it was claimed that Matt and Mr. Singhal share an office (with two other people). Hmmm … do they talk, do you think? :D

I think I’d have called it “Query Deserves Recency”.

RSS Feed

2 Comments for QDF – WTF…?

Author comment by mike | June 16, 2007 at 4:33 am

“Query deserves freshness”? I thought that all queries deserved freshness! A stale index results in poor quality SERPs that could lead to lots of pages that no longer exist or have been changed since they were indexed. A fresh index solves this problem to a great extent.
I disagree that all queries deserve the “Last crawled date” freshness, because it pits two goals (fresh index and effective utilisation of crawl resources) against each other.

Some content will not to change much. Take the W3C HTML stanadards:
http://www.w3.org/TR/html401/

Given that some content doesn’t change, I would argue that it is prudent to not bother downloading such pages very often. In terms of bandwidth and useful utilisation of a parser (which, with ever increasing elements to uncover, is probably taking more CPU cycles than ever), the resources spent grabbing an unchanging page are better spent grabbing either new content for the index, or refreshing content that changes regularly, like the front page of Digg or news sites.

Reconciling an algorithm that optimises the utilisation of crawl resources, that likely discriminates against primary sources that never change, with a usage of “last crawled date” in rankings would be problematic, especially when the search clearly wants the primary source.

Sure, you could make the last crawl date for such pages constantly NOW, but that makes the ability to optimise crawling schedules (which would be as close to 1 second after publishing as possible :) ) potentially harmful to results. Optimising in two direction (crawl effieciency and relevance) is terribly difficult.

Maybe, rather than freshness, it is better thought of as “staleness” or “document accuracy”, i.e. the likelihood that the data stored for a document is accurate, with staleness the moment at which trust in the accuracy of the indexed copy moves from accurate reflection to not an accurate reflection.

Freshness, per se, doesn’t really matter, but accuracy does. Such a measure would aslo alow different documents to loss the trust in teh acuracy of the indexed copy at different rates (a news site’s home page, for example, would likely be innaccurate after 10-15 minutes; a side with every legal judgement likely has a lot longer decay cycle).

Author comment by alan | July 13, 2007 at 12:38 pm

> Given that some content doesn’t change, I would argue that it is prudent to not bother downloading such pages very often

True, but that’s why the “If-Modified-Since” HTTP request header exists. The SERPs for any query consist of fresh results, i.e. results that are likely to still exist and be what the robot saw. That’s not to say that the results need to be recently created – just recently verified.

> Maybe, rather than freshness, it is better thought of as … “document accuracy”, i.e. the likelihood that the data stored for a document is accurate,

That’s exactly what’s at issue.

Leave a comment!

You must be logged in to post a comment.

<<

>>

Theme Design by devolux.org