The Silver Spike | The Official SilverDisc Blog

Articles About | General

It appears that, some time ago, Google removed details of results prefetching from its Webmaster guidelines while continuing to implement results prefetching in its search results.

If you haven’t a clue what I’m talking about, the Wayback Machine has the original Google Webmaster help on this topic, which I’ll paste here verbatim in order to make it searchable (Wayback Machine pages aren’t indexed by search engines):

Results Prefetching Questions

1. What is “results prefetching,” and how does it impact my site?

On some searches, Google uses a special <link> tag supported by Firefox and Mozilla to instruct the browser to download the top search result before the user clicks on the result. When the user clicks on the top result, the destination page will load faster than before. This tag is only inserted when it is likely that the user will click on the first link.

For example, when a Firefox user searches for [stanford], Google includes the following tag in the results HTML:

<link rel="prefetch" href="http://www.stanford.edu/">

The official Mozilla Link Prefetching FAQ describes the behavior of this tag in detail.

Prefetching may impact your site because the prefetch request will happen whether or not the user clicks on the result, so it may result in additional traffic to your web server. Google only inserts this tag when there is a high likelihood that the user will click on the top result, but clearly this heuristic is not right 100% of the time.

2. Can I distinguish prefetch requests from normal requests?

Yes, as described in the Mozilla Link Prefetching FAQ, prefetch requests include the additional HTTP header

X-moz: prefetch

3. I want to block/ignore prefetch requests. What should I do?

To block or ignore prefetch requests (from Google and other web sites), you should configure your web server to return a 404 HTTP response code for requests that contain the “X-moz: prefetch” header.

What else do you need to know about results prefetching?

If you run Google Analytics or another JavaScript-based analytics package, you won’t see these prefetched pages in your analytics. That’s because only the HTML is prefetched, not the images, JavaScript, etc. referenced by that HTML, which means that the Analytics JavaScript is never even fetched, let alone executed. You need to look at raw log files to see prefetched pages.

Google only issues the prefetch code when they are very confident that searchers will click on the #1 result (as in their example, a search for stanford). Most times, particularly for more “normal” sites (i.e. not Stanford), Google won’t issue the code. So you may never see this on your own site.

However, it’s worth being aware of this issue because if you do see a prefetch in your raw logs you’ll want to know why; and because, depending on how you calculate conversions, the fact that a page is prefetched but never viewed by a searcher may significantly affect your conversion tracking and monetisation on that page. I’m surprised that Google removed this info from their Webmaster help.

No tags

In my last post I looked at the rel=canonical tag and finished by promising to look at some of the limitations of rel=canonical and consider some alternatives.

Many of the alternatives have existed for some time – the use of redirects and cookies, for example. However, the introduction of a rel=canonical tag was an opportunity for search engines to also introduce other, more efficient, standards. These are the alternatives I would like to consider – alternatives that don’t exist yet, which the search engines could have introduced this time around and may introduce in future.

I see the rel=canonical tag as analogous to the meta robots tag, and therefore suffering from many of the same limitations:

  • The rel=canonical tag is located in a HTML file, and that HTML therefore needs to be fetched and parsed in order for the tag to be seen and acted upon. Therefore, the tag does not save bandwidth or CPU for the Web site or search engine.
  • The rel=canonical tag is located in a HTML file and gives instructions about that file. Therefore, it cannot be used to solve canonical issues for non-HTML files such as images, PDF files or Flash movies.
  • The rel=canonical tag acts at a micro-level rather than a macro-level. Therefore it is difficult to review that a site-wide policy has been correctly implemented using rel=canonical; Every possible file has to be inspected. Also, code changes have to be made in order to write the rel=canonical tag. This may slow its implementation.

Where the above issues apply to rel=canonical, and similar issues apply to the meta robots tag, it struck me that an opportunity has been missed to also solve canonical issues through the robots.txt file. Any fix applied through robots.txt would not suffer from the above problems.

Extensions to robots.txt could be made in a number of ways. For example, a mod_rewrite-type syntax could be introduced. However, I’m not sure anything so advanced is needed. Most canonical issues arise from three things:

  1. the use of query parameters in dynamic URLs.
  2. www versus non-www versions of a site (and other subdomains).
  3. inconsistent use of default index page URLs.

Some simple robots.txt fields to control these issues would fix most problems without the pain and errors that a mod_rewrite implementation would create.

Query Parameters

Google Analytics and Yahoo Site Explorer are two examples of tools that allow simple manipulation of URL query parameters. Yahoo’s Dynamic URL Help lists some of the crawling, indexing and ranking benefits of this approach.

Yahoo Site Explorer allows you to remove a query parameter or set a query parameter to a default value within a URL. Using this, a URL such as

  • http://www.example.com/page.php?refby=affiliate&sid=abc123

could be crawled and indexed as

  • http://www.example.com/page.php?refby=yhoo_srch

The session id has been dropped and the referrer has been overwritten as yhoo_srch, meaning all traffic sent by Yahoo Search could be attributed to Yahoo Search rather than the affiliate. This functionality could be implemented in robots.txt using a new syntax something like the following:

User-Agent: Slurp
Disallow:
QueryParam: -sid
QueryParam: refby=yhoo_srch

meaning that the sid query parameter is to be dropped (as it is preceded by ‘-’) and the refby query parameter is to be overwritten with a default value (as a default value is provided). The same effect could be achieved with a single line:

User-Agent: Slurp
Disallow:
QueryParam: -sid, refby=yhoo_srch

One problem with both Google Analytics and Yahoo Site Explorer is that you must list the query parameters you wish to drop from URLs – not the ones you wish to keep. Because third parties can link to your site, you’re not in control of the links they create and the query parameters they use. Therefore, canonical issues can only truly be solved by specifying the query parameters you wish to keep, rather than those you wish to drop. To solve this, wildcards could specify the default action to be applied to all non-listed query parameters. Therefore I propose the following syntax:


QueryParam: retainParam[=defaultValue]
QueryParam: -dropParam
QueryParam: [-]*

where…

  • retainParam[=value]: specfies a query parameter you definitely want to keep, and an optional default value you want it set to
  • -dropParam: specifies a query parameter you definitely want to drop
  • *: means keep all query parameters not specified (default)
  • -*: means drop all query parameters not specified

Default domain and Index Pages

Two further, much simpler additions to robots.txt could clear up the majority of other canonical problems. These are Domain and IndexPage:


Domain: defaultDomain
IndexPage: defaultIndexPage

defaultDomain specfies the default domain for this robots.txt file. For example, if the search engine retrieves http://www.example.com/robots.txt and finds …


Domain: http://example.com/

…it would know to index all URLs under the non-www domain. This would allow multiple parked domains to share the same content and robots.txt file without needing redirects or causing canonical issues, which is currently a common problem.

The IndexPage field specifies a default index page for the domain, i.e. a page for which the following two URLs are considered equivalent:

http://www.example.com/path/

http://www.example.com/path/defaultIndexPage

Conclusion

In this post I’ve proposed three new fields to add to robots.txt to provide an alternative to the rel=canonical tag, just as the current robots.txt fields are themselves alternatives to the meta robots tag, with their own advantages and disadvantages. The chief advantages I see of canonicalising through robots.txt are:

  • Acting through robots.txt means that a resource does not have to be fetched and parsed in order for the canonicalisation instructions to be followed. Therefore, bandwidth and CPU is saved for both the Web site and search engine.
  • Acting through robots.txt means that canonical issues can be solved for non-HTML files such as images, PDF files or Flash movies.
  • Acting through robots.txt means large scale changes can be made very quickly and easily without the need for any code changes. It’s also much easier to review the changes that have been made.

The Domain, IndexPage and QueryParam fields would all be optional and independent of each other. It would be great if the search engines could introduce some or all of these ideas into robots.txt.

No tags

I’ve been meaning to write about the new rel=canonical tag, which was proposed by Google, Yahoo and Microsoft on February 12. I managed to squeeze some thoughts on it into my presentation and workshop at SES London, and I’ll be speaking more about it at SES New York next month, but before I blogged about it I really wanted to write more about URL Canonicalisation and Normalisation in general.

Canonicalisation or Canonicalization?
Normalisation or Normalization?

I’m British, so I say Canonicalisation and Normalisation. Your mileage may vary.

What is URL Canonicalisation?

We’re talking about search engines here, so let’s try a definition that applies generally, but leans towards search:

URL Canonicalisation
involves taking a set of different URLs that all serve or lead to the same or similar content, and applying rules to select one URL from that set under which that content should be indexed or presented.

I’ve hyperlinked the terms I think are important to more detail below, but before we go into them let’s try defining URL Normalisation.

URL Normalisation
involves taking a single URL and applying a normalisation algorithm to produce a standard form for that URL.

Others define normalisation and canonicalisation as all part of the same thing, but I like to think of them as separate processes. To my way of thinking:

  • you can normalise a single URL but you can only canonicalise a set of URLs
  • an un-normalised URL will serve the same content as a normalised URL, because it’s the same URL
  • all indexed URLs are normalised; not all are canonicalised
  • normalisation occurs before canonicalisation

Now let’s go back and look at those hyperlinked terms in more detail.

Set of different URLs

This is the key to canonicalisation and why it’s needed: the same content is being presented at a number of different URLs. By different URLs, I mean those URLs are really different to each other – they could potentially show different content but (in this case) they don’t.

Here is an example set of URLs:

  • http://www.example.com/
  • http://example.com/
  • http://www.example.com/index.html
  • http://example.com/default.asp
  • http://www.example.com/?referrer=affiliateName
  • http://www.example.com/?sessionid=123456

All serve or lead to the same or similar content

If each of the above URLs served the same, or essentially the same, content, it’s likely that they would be canonicalised to fewer URLs – possibly only one. If they each served completely different content, then it’s much less likely that this canonicalisation would take place. By “or lead to”, I mean that the URL may redirect (e.g. with a HTTP 301 or HTTP 302 redirect) to another URL.

Canonicalisation Rules

The rules for canonicalisation vary from engine to engine and time to time. Here are a few examples of when canonicalisation will take place …

  • If www and non-www versions of the URL exist, then canonicalise
  • If the same base URL is seen with different numbers of query parameters, then canonicalise
  • If the filename component of the URL matches a known set of index pages (e.g. index.*, default.*, etc.) then canonicalise
  • If the home page (“/”) redirects to another page, then canonicalise

… and here are some examples of how canonicalisation will take place:

  • Choose the URL with the highest Pagerank (or similar link-based or other off-page criteria)
  • Obey rel=nofollow webmaster hint
  • Choose the simplest URL (e.g. the shortest URL, or the one with fewest query parameters)

Indexed or presented

Sometimes only one URL from a set will be indexed, which means that it will always be the candidate URL to be presented in a set of search results.

At other times multiple URLs may be indexed, even though they are known to be part of the same canonical set. One of these URLs will be selected to appear in a given set of search results. The URL that is selected may vary (for example, by query or by searcher location) – but only one will ever appear on a given search results page.

Single URL

Normalisation operates on a single URL rather than on a set of URLs. That single URL may need be supplemented with other data in order for normalisation to take place. For example, un-normalised URLs may be relative or absolute. A normalised URL will always be a fully-qualified absolute URL so, along with a relative URL, the containing URL or tag will need to be known in order for normalisation to take place.

Normalisation algorithm to produce a standard form

Like canonicalisation rules, the normalisation algorithm may vary from engine to engine and time to time. However, it’s much less likely to vary. Here is an example of the kind of things that are done during normalisation:

  1. convert a relative URL to an absolute URL
  2. convert the scheme and the host name components of the URL to lower case
  3. remove the port component if it matches the default port
  4. escape characters that should be represented as octets (or a +)
  5. unescape octets that are better represented as plain characters
  6. convert all escape sequences to upper case

Here are some examples of each operation:

  1. In http://www.silverdisc.co.uk/ , a link to “/contact.html” would be normalised to http://www.silverdisc.co.uk/contact.html
  2. HTTP://WWW.SILVERDISC.CO.UK/contact.html would be normalised to http://www.silverdisc.co.uk/contact.html
  3. http://www.silverdisc.co.uk:80/contact.html would be normalised to http://www.silverdisc.co.uk/contact.html, because 80 is the default port for HTTP connections.
  4. http://www.silverdisc.co.uk/contact.html?name=Alan Perkins would be normalised to http://www.silverdisc.co.uk/contact.html?name=Alan+Perkins or http://www.silverdisc.co.uk/contact.html?name=Alan%20Perkins, because a space is not a valid character in a URL.
  5. http://www.silverdisc.co.uk/cont%61ct.html would be normalised to http://www.silverdisc.co.uk/contact.html, because %61 is better represented as the character “a” in a URL.
  6. A %2a in a URL would be converted to %2A for consistency

Summary

That completes this introduction to URL canonicalisation and normalisation. In the next post, I’ll look at rel=nofollow.

No tags

Mar/08

10

SilverDisc Celebrates 15th Birthday

SilverDisc is 15 years old today!

SilverDisc was established on March 10, 1993 by three people (Alan Perkins, Allan Todd and Eric Barfield) who had met while working on interactive CD applications at Philips in Surrey. That is why the company is called SilverDisc – because its first products and services were delivered on CD, a “silver disc”.

Alan, Allan and Eric were three guys who enjoyed making a living while having fun doing techie things. Well, they considered it fun anyway! In 1993 they were using very low speed dial-up modems and Compuserve to communicate with each other and a small world of CD developers. In 1994 they got their first Web server and started hosting Web sites for clients. In 1995 they developed a fully functional shop with online credit card capabilities for one client, and started hosting the Web site for HarperCollins, a major publisher – not bad going for three guys working from home and having fun.

Early Years of Search Marketing

In late 1995, on the day AltaVista launched, SilverDisc realised the potential that search engines held for marketing purposes. They started marketing themselves and their clients through search engine optimization, although that phrase was not in use at the time.

During the mid-to-late nineties SilverDisc continued to deliver CD and DVD products and services as well as Internet services and Internet marketing. It remained three guys having fun. Then, a few things happened in quick succession:

  • Allan moved back to Scotland and got a real job working for somebody else – he wanted his young family to get a “proper Scottish education”
  • Alan moved back to Northamptonshire, mainly to tap into a support network for his young family, but remained with SilverDisc
  • Alan and Eric teamed up with a distant relative of Eric’s and formed a new company, e-Brand Management, to take advantage of the “dot com boom”

1998 to 2000 was spent developing patents, products and services based around some ideas that Alan had in the years since 1995. In parallel, SilverDisc continued to service its existing client base.

The patents were filed in 1999 and have since been granted. They cover some very fundamental search engine ground. One patent is in crawling and indexing, and the other is in personalisation – both are hot topics today, nine years later. The first product, Search Mechanics, was launched at the very first Search Engine Strategies to be held in the UK and e-Brand Management was one of only five exhibitors there.

That covers “SilverDisc – the early years”. In a future post, I’ll look at what has happened since 2000. :)

No tags

Charity Christmas cards are commendable, but still a large part of the cost of buying and posting charity cards does not end up with the charity itself.

So this year, SilverDisc has chosen to donate its entire Christmas card budget of £250 to local charity Zach’s Helping Hand, and instead send an electronic Christmas card.

Zach’s Helping Hand is used by families with children near to the end of life to receive palliative care within the love of their own homes. It is dedicated to the memory of Zach Sanders, who died of a brain tumour aged just two, and was set up by Zach’s parents Andy and Claire.

The photo shows me, Andy, Bella (Zach’s sister) and Lynda Litchfield of SilverDisc.

Alan Perkins and Lynda Litchfield of SilverDisc presenting a cheque for £250 to Andy and Bella Sanders of Zach's Helping Hands

No tags

Apr/07

26

Welcome to The Silver Spike

Hello, world! Welcome to the Silver Spike, the “Official SilverDisc Blog”.

If you don’t know who or what SilverDisc is, then check out www.silverdisc.co.uk, our main Web site.

I’ve always been slightly sceptical about a SilverDisc blog. I was finally persuaded it might be a good idea while reading “Naked Conversations” (by Robert Scoble and Shel Israel) on holiday recently. While there is much in the book that I disagree with (Scoble and Israel admit to being biased and evangelical), in the end I was forced to agree that they had a point – reading blogs without writing a blog is like owning a telephone where only the receiver works. And the name of this blog, “The Silver Spike”, comes from the concluding chapter of Naked Conversations:

But if blogging is truly part of a revolution, will it be bloodless? We see a clear and present danger for practitioners of traditional, one-direction advertising, marketing. We see its champions in a change or die situation. Blogging and the social media are steadily pounding a silver spike into the heart of it.

So, it’s not called The Silver Spike because it will be the spiteful, vehement outpourings of SilverDisc – we’re not like that :) . It’s called The Silver Spike because it’s got the word “silver” in it, which was important to us, and because we like the quote – although to me it seems that Scoble and Israel have mixed up their vampire and werewolf metaphors.

While I’m on the topic of Naked Conversations, I’ll cover off the main things I disagreed with in the book. The first was the Six Pillars of Blogging, the fact that blogs are:

  1. Publishable
  2. Findable
  3. Social
  4. Viral
  5. Syndicatable
  6. Linkable

The problem I have with that list is that many other Web sites, or types of Web site, meet most if not all of those criteria. For example, forums are very similar to blogs – better in some ways since they provide many-to-many communication rather than one-to-many. I really needed some convincing of the difference between a blog and a Web site, and the authors failed to deliver it with these six pillars. The main conclusion I came to is that blogging is particularly powerful in the “Syndicatable” sense, in that it makes syndication very easy, and this in combination with the other pillars was arguably blogging’s unique strength. The other conclusion I came to is that unless I gave it a try, I might never truly understand it.

Another thing that really annoyed me in Naked Conversations were the constant references to “Google Juice”, i.e. the power of blogs to influence your rankings in search engines, particularly Google. A couple of examples:

Every time you post, Google notices the update and that boosts your ratings. Google also pays attention to links—other sites that connect to you. Bloggers who find what you write interesting, will post on their own sites and link back to you. Those links also boost your “Google juice.” In fact, nothing will boost your search engine standing better.

I told him that because he didn’t have a real blog, he had no Google juice.

I wonder if the authors cringe that they actually published that. IMO, far too much emphasis was placed throughout the book, naively, on the influence that blogs have on search engines.

A third thing that troubled me was the implication that because people are switching off to advertisements and push marketing, blogs were a good way of marketing to those people in a less obvious way. I was left with the impression that blogs were a good means of advertising to people without letting them know you were advertising. That’s very dangerous ground, but the authors seemed to think it was a Good Thing.

Anyway, despite all of that I thought the book was a good read and I’d recommend it to you. And it forced me into doing what I’ve been thinking I ought to do for a long time, but never quite got around to – starting a SilverDisc blog. Take the second star to the right and straight on ’til morning. ;)

No tags

Theme Design by devolux.org