The Silver Spike | The Official SilverDisc Blog

Articles About | SEO

Site Architecture Matt Cutts has given a very useful interview with Eric Enge, which rounds up a lot of information architecture and technical architecture issues.

There’s nothing really new here, but it’s good to get all this info into one place and to see it confirmed by Matt.

Topics covered:

  • crawl budget/indexation cap – the use of Pagerank and host load to control crawl depth and frequency
  • the effect of duplicate content on Pagerank
  • session IDs and affiliate IDs in links/URLs
  • faceted navigation – good to see Matt confirming that the advice I gave at SES London, and will be giving next week at SMX Munich, is all correct.
  • Different ideas for use of the rel=canonical tag
  • 301 redirects and how they differ from 302 redirects
  • Google Webmaster Tools (WMT) ignore parameters
  • Pagerank Sculpting and its effectiveness in the modern world
  • Javascript, IFRAME and PDF handling
  • Paid links and nofollow

Overall, the article strongly reinforces the fact that a successful site architecture is essential to SEO success.

No tags

Google’s John Mueller has published a good article on working with multi-regional web sites. He confirms that country-code Top Level Domains (ccTLDs) are the best way to host multi-regional content. He also clears up some of the myths surrounding duplicate content on multi-regional domains, which is most welcome.

John doesn’t mention that the same thinking applies even if you are targeting a single country. A ccTLD is the best way to indicate the location of your target market to search engines – and to that market itself, of course.

A URL gives you at least five places to target a country: domain (ccTLD), subdomain (de.domain.com), directory(www.domain.com/de/), path parameters (www.domain.com/;domain=de) and query parameters(www.domain.com/?domain=de). However, there are lots more axes for the content to be split along:

  • Category – Web, Enterprise, Social, Real Time
  • Context – Intranet, Library, Personal
  • Topic – Health, Travel, Jobs, etc.
  • Vertical – Finance, Education, Government, etc.
  • Platform – Desktop, Mobile, Television, Kiosk
  • Format – Text, Image, Audio, Video, Map

(Note: the above is slightly modified from a table provided by Search Patterns, an excellent read)

Given this number of ways of organising content, and the fact that the location and language of your target audience are major considerations (worthy of a major axis), in all but the most trivial cases a ccTLD is the obvious choice for geo-targeting. It’s good to see official written confirmation of this from Google.

No tags

Mar/10

9

Calling for link spam reports

I see that Matt Cutts of Google is calling for link spam reports.

I’m still very troubled by this paid links issue after all these years!

I agree it’s Google’s right to penalise or promote any page/site in its natural listings, which represent Google’s subjective opinion of relevancy.

However, the idea that all paid links are bad/”evil” is wrong in so many ways:

  • Paid links pre-date Google.
  • There is no machine-readable standard for labelling a paid link. I’ll repeat that – there is no machine-readable standard for labelling a paid link.
  • Labelling paid links fails the “Does this makes sense in the absence of search engines?” ethical test. The answer may well be “Yes”. (Where the answer is “No”, I agree paid links are spam).
  • Labelling paid links fails the “Would I do this if search engines did not exist?” test. In fact, you have to know that Google exists, and that they mind about paid links, in order to label those paid links in the non-standard way that Google asks you to label them. This is perhaps my biggest beef with Google’s approach to paid links – they actually violate one of Google’s published Webmaster principles.
  • What does “paid” mean anyway? An actual exchange of cash? If you look at the top results for any hugely commercial field, say “car insurance”, it’s hard to believe that there is no commercial influence in the results! When all that a company does is commercial, then every link (positive or negative) to that company’s site is commercial in nature.

I understand that a market in paid links arose because of Google’s algorithm.

However, the irony is that in responding to that market by asking all publishers to label paid links in a non-standard way, Google violated its own principles. It started to ask publishers to adapt what they published to suit Google (because Google existed), and called them spammers if they didn’t. That’s the wrong way around. It’s the spammers that do stuff purely because Google exists!

No tags

SilverDisc offers an early Easter Egg to Silver Spike readers – a rel=canonical calculator to help you help search engines to deliver more high quality, high converting visitors to your site.

This builds on the recent series of posts on this topic:

The rel=canonical calculator will go on general release in the next couple of weeks, and we will be making some PHP code available to insert the rel=canonical tag on your own pages. That’s right – FREE CODE. Register using the instructions provided on the rel=canonical calculator page.

No tags

In my last post I looked at the rel=canonical tag and finished by promising to look at some of the limitations of rel=canonical and consider some alternatives.

Many of the alternatives have existed for some time – the use of redirects and cookies, for example. However, the introduction of a rel=canonical tag was an opportunity for search engines to also introduce other, more efficient, standards. These are the alternatives I would like to consider – alternatives that don’t exist yet, which the search engines could have introduced this time around and may introduce in future.

I see the rel=canonical tag as analogous to the meta robots tag, and therefore suffering from many of the same limitations:

  • The rel=canonical tag is located in a HTML file, and that HTML therefore needs to be fetched and parsed in order for the tag to be seen and acted upon. Therefore, the tag does not save bandwidth or CPU for the Web site or search engine.
  • The rel=canonical tag is located in a HTML file and gives instructions about that file. Therefore, it cannot be used to solve canonical issues for non-HTML files such as images, PDF files or Flash movies.
  • The rel=canonical tag acts at a micro-level rather than a macro-level. Therefore it is difficult to review that a site-wide policy has been correctly implemented using rel=canonical; Every possible file has to be inspected. Also, code changes have to be made in order to write the rel=canonical tag. This may slow its implementation.

Where the above issues apply to rel=canonical, and similar issues apply to the meta robots tag, it struck me that an opportunity has been missed to also solve canonical issues through the robots.txt file. Any fix applied through robots.txt would not suffer from the above problems.

Extensions to robots.txt could be made in a number of ways. For example, a mod_rewrite-type syntax could be introduced. However, I’m not sure anything so advanced is needed. Most canonical issues arise from three things:

  1. the use of query parameters in dynamic URLs.
  2. www versus non-www versions of a site (and other subdomains).
  3. inconsistent use of default index page URLs.

Some simple robots.txt fields to control these issues would fix most problems without the pain and errors that a mod_rewrite implementation would create.

Query Parameters

Google Analytics and Yahoo Site Explorer are two examples of tools that allow simple manipulation of URL query parameters. Yahoo’s Dynamic URL Help lists some of the crawling, indexing and ranking benefits of this approach.

Yahoo Site Explorer allows you to remove a query parameter or set a query parameter to a default value within a URL. Using this, a URL such as

  • http://www.example.com/page.php?refby=affiliate&sid=abc123

could be crawled and indexed as

  • http://www.example.com/page.php?refby=yhoo_srch

The session id has been dropped and the referrer has been overwritten as yhoo_srch, meaning all traffic sent by Yahoo Search could be attributed to Yahoo Search rather than the affiliate. This functionality could be implemented in robots.txt using a new syntax something like the following:

User-Agent: Slurp
Disallow:
QueryParam: -sid
QueryParam: refby=yhoo_srch

meaning that the sid query parameter is to be dropped (as it is preceded by ‘-’) and the refby query parameter is to be overwritten with a default value (as a default value is provided). The same effect could be achieved with a single line:

User-Agent: Slurp
Disallow:
QueryParam: -sid, refby=yhoo_srch

One problem with both Google Analytics and Yahoo Site Explorer is that you must list the query parameters you wish to drop from URLs – not the ones you wish to keep. Because third parties can link to your site, you’re not in control of the links they create and the query parameters they use. Therefore, canonical issues can only truly be solved by specifying the query parameters you wish to keep, rather than those you wish to drop. To solve this, wildcards could specify the default action to be applied to all non-listed query parameters. Therefore I propose the following syntax:


QueryParam: retainParam[=defaultValue]
QueryParam: -dropParam
QueryParam: [-]*

where…

  • retainParam[=value]: specfies a query parameter you definitely want to keep, and an optional default value you want it set to
  • -dropParam: specifies a query parameter you definitely want to drop
  • *: means keep all query parameters not specified (default)
  • -*: means drop all query parameters not specified

Default domain and Index Pages

Two further, much simpler additions to robots.txt could clear up the majority of other canonical problems. These are Domain and IndexPage:


Domain: defaultDomain
IndexPage: defaultIndexPage

defaultDomain specfies the default domain for this robots.txt file. For example, if the search engine retrieves http://www.example.com/robots.txt and finds …


Domain: http://example.com/

…it would know to index all URLs under the non-www domain. This would allow multiple parked domains to share the same content and robots.txt file without needing redirects or causing canonical issues, which is currently a common problem.

The IndexPage field specifies a default index page for the domain, i.e. a page for which the following two URLs are considered equivalent:

http://www.example.com/path/

http://www.example.com/path/defaultIndexPage

Conclusion

In this post I’ve proposed three new fields to add to robots.txt to provide an alternative to the rel=canonical tag, just as the current robots.txt fields are themselves alternatives to the meta robots tag, with their own advantages and disadvantages. The chief advantages I see of canonicalising through robots.txt are:

  • Acting through robots.txt means that a resource does not have to be fetched and parsed in order for the canonicalisation instructions to be followed. Therefore, bandwidth and CPU is saved for both the Web site and search engine.
  • Acting through robots.txt means that canonical issues can be solved for non-HTML files such as images, PDF files or Flash movies.
  • Acting through robots.txt means large scale changes can be made very quickly and easily without the need for any code changes. It’s also much easier to review the changes that have been made.

The Domain, IndexPage and QueryParam fields would all be optional and independent of each other. It would be great if the search engines could introduce some or all of these ideas into robots.txt.

No tags

So, Google, Yahoo, Microsoft and, more recently, Ask have announced the new “canonical” link type or, more colloquially, the rel=canonical tag.

Much has already been written about this tag and its purpose: to help prevent duplicate content issues. Probably the best summary is this Matt Cutts video:

This tag is a welcome addition to the armoury in the fight against duplicate content issues. In addition to Matt’s comments, I would make the following points:

Copyright Protection

Scrapers are forever copying content and publishing it on their own sites/splogs. Sometimes they are exceptionally lazy or stupid, even to the extent that they copy Adsense code onto their own sites. If they copy your rel=canonical tag onto their site, that would give a strong “hint” to the search engine that you were the original owner of the content:

<link rel="canonical" href="href="http://www.mysite.com/my/content/" />

Microsoft Platforms

Matt made reference to the Microsoft platform in his video, but I would emphasise the point. Microsoft’s implementation of RFC 2396 is flawed. The path component of a URL is supposed to be case sensitive, but Microsoft makes it case insensitive. If there are n alphabetic characters in the path, then a Microsoft implementation gives 2n possible variations of that path, where there should be only one. For example, if n=1 and the path is “/a/”. Microsoft would allow “/a/” and “/A/”; if n=2 and the path is “/ab/”. Microsoft would allow “/ab/”, “/aB”, “/Ab” and “/AB/”; and so on. 2n variations gives vast potential for duplicate content and it is a big issue with sites built on the Microsoft platform. The rel=canonical tag makes it very easy to specify the correct, case-sensitive path on a Microsoft platform:

<link rel="canonical" href="http://www.mysite.com/my/case/sensitive/path/" />

Static Web Content

Static web content is content that is stored in the format in which it is delivered. Typically, static content is served under a static URL (a URL that does not contain a question mark). However, it is possible to link to static content and append query parameters, even though these query parameters will have no impact on the content that is served. One example of when this might happen is when a referrer parameter is passed to a JavaScript function within the static content:

<a href="http://www.mysite.com/?referrer=myAffiliate0001">Affiliate Link</a>

Thousands of links can be created to a single, static URL, each with a different referrer query parameter attached. For sites built on static content, trying to manage such links has been difficult in the past. Now, it’s relatively easy. Each page of static content simply needs to contain a rel=canonical tag:

<link rel="canonical" href="http://www.mysite.com/my/static/url.html/" />

Conclusions: rel=canonical

For the reasons stated above, I would recommend the use of a rel=canonical tag in all static content. In fact, I would recommend its use in all content, static or dynamic – with appropriate care of course. It’s a powerful tag and using it wrongly could have dire consequences.

In the next post I’ll look at some of the limitations of the rel=canonical tag and consider some alternatives.

No tags

I’ve been meaning to write about the new rel=canonical tag, which was proposed by Google, Yahoo and Microsoft on February 12. I managed to squeeze some thoughts on it into my presentation and workshop at SES London, and I’ll be speaking more about it at SES New York next month, but before I blogged about it I really wanted to write more about URL Canonicalisation and Normalisation in general.

Canonicalisation or Canonicalization?
Normalisation or Normalization?

I’m British, so I say Canonicalisation and Normalisation. Your mileage may vary.

What is URL Canonicalisation?

We’re talking about search engines here, so let’s try a definition that applies generally, but leans towards search:

URL Canonicalisation
involves taking a set of different URLs that all serve or lead to the same or similar content, and applying rules to select one URL from that set under which that content should be indexed or presented.

I’ve hyperlinked the terms I think are important to more detail below, but before we go into them let’s try defining URL Normalisation.

URL Normalisation
involves taking a single URL and applying a normalisation algorithm to produce a standard form for that URL.

Others define normalisation and canonicalisation as all part of the same thing, but I like to think of them as separate processes. To my way of thinking:

  • you can normalise a single URL but you can only canonicalise a set of URLs
  • an un-normalised URL will serve the same content as a normalised URL, because it’s the same URL
  • all indexed URLs are normalised; not all are canonicalised
  • normalisation occurs before canonicalisation

Now let’s go back and look at those hyperlinked terms in more detail.

Set of different URLs

This is the key to canonicalisation and why it’s needed: the same content is being presented at a number of different URLs. By different URLs, I mean those URLs are really different to each other – they could potentially show different content but (in this case) they don’t.

Here is an example set of URLs:

  • http://www.example.com/
  • http://example.com/
  • http://www.example.com/index.html
  • http://example.com/default.asp
  • http://www.example.com/?referrer=affiliateName
  • http://www.example.com/?sessionid=123456

All serve or lead to the same or similar content

If each of the above URLs served the same, or essentially the same, content, it’s likely that they would be canonicalised to fewer URLs – possibly only one. If they each served completely different content, then it’s much less likely that this canonicalisation would take place. By “or lead to”, I mean that the URL may redirect (e.g. with a HTTP 301 or HTTP 302 redirect) to another URL.

Canonicalisation Rules

The rules for canonicalisation vary from engine to engine and time to time. Here are a few examples of when canonicalisation will take place …

  • If www and non-www versions of the URL exist, then canonicalise
  • If the same base URL is seen with different numbers of query parameters, then canonicalise
  • If the filename component of the URL matches a known set of index pages (e.g. index.*, default.*, etc.) then canonicalise
  • If the home page (“/”) redirects to another page, then canonicalise

… and here are some examples of how canonicalisation will take place:

  • Choose the URL with the highest Pagerank (or similar link-based or other off-page criteria)
  • Obey rel=nofollow webmaster hint
  • Choose the simplest URL (e.g. the shortest URL, or the one with fewest query parameters)

Indexed or presented

Sometimes only one URL from a set will be indexed, which means that it will always be the candidate URL to be presented in a set of search results.

At other times multiple URLs may be indexed, even though they are known to be part of the same canonical set. One of these URLs will be selected to appear in a given set of search results. The URL that is selected may vary (for example, by query or by searcher location) – but only one will ever appear on a given search results page.

Single URL

Normalisation operates on a single URL rather than on a set of URLs. That single URL may need be supplemented with other data in order for normalisation to take place. For example, un-normalised URLs may be relative or absolute. A normalised URL will always be a fully-qualified absolute URL so, along with a relative URL, the containing URL or tag will need to be known in order for normalisation to take place.

Normalisation algorithm to produce a standard form

Like canonicalisation rules, the normalisation algorithm may vary from engine to engine and time to time. However, it’s much less likely to vary. Here is an example of the kind of things that are done during normalisation:

  1. convert a relative URL to an absolute URL
  2. convert the scheme and the host name components of the URL to lower case
  3. remove the port component if it matches the default port
  4. escape characters that should be represented as octets (or a +)
  5. unescape octets that are better represented as plain characters
  6. convert all escape sequences to upper case

Here are some examples of each operation:

  1. In http://www.silverdisc.co.uk/ , a link to “/contact.html” would be normalised to http://www.silverdisc.co.uk/contact.html
  2. HTTP://WWW.SILVERDISC.CO.UK/contact.html would be normalised to http://www.silverdisc.co.uk/contact.html
  3. http://www.silverdisc.co.uk:80/contact.html would be normalised to http://www.silverdisc.co.uk/contact.html, because 80 is the default port for HTTP connections.
  4. http://www.silverdisc.co.uk/contact.html?name=Alan Perkins would be normalised to http://www.silverdisc.co.uk/contact.html?name=Alan+Perkins or http://www.silverdisc.co.uk/contact.html?name=Alan%20Perkins, because a space is not a valid character in a URL.
  5. http://www.silverdisc.co.uk/cont%61ct.html would be normalised to http://www.silverdisc.co.uk/contact.html, because %61 is better represented as the character “a” in a URL.
  6. A %2a in a URL would be converted to %2A for consistency

Summary

That completes this introduction to URL canonicalisation and normalisation. In the next post, I’ll look at rel=nofollow.

No tags

Matt Cutts has stirred up a little hornets’ nest with his “What should NOINDEX do?” post. Matt reckons the topic will be colossally boring to some people – but not to me. For some reason I find Robots standards fascinating. Yep, I know I’m weird.

The crux of Matt’s issue is …

The question is whether Google should completely drop a NOINDEX’ed page from our search results vs. show a reference to the page, or something in between?

The obvious response is to completely drop the NOINDEX’ed page. NOINDEX is made up of the two words NO and INDEX; so it means do not index, right?

Maybe not. It’s important to be precise here. What exactly does NOINDEX mean?

Often when talking about indexing issues, it’s useful to separate in your mind the indexing of a URL from the indexing of the content at that URL. This concept is particularly important in the contexts of URL canonicalization, duplicate content and … robots standards. I’ll restrict this discussion to the NOINDEX part of the robots standards, but an equally interesting discussion exists around robots.txt too.

Once we separate URL and content, the question “What exactly does NOINDEX mean?” can be answered in several ways:

1) Index the URL but not the content
2) Don’t index the URL or the content
3) (Somehow, not sure how!) index the content but not the URL

One thing is for sure … it does not mean index both the content and the URL. :D

In my opinion NOINDEX should definitely mean “Don’t index the content”. Definitely. No question.

The question of whether it should mean “Don’t index the URL” is an interesting one. There are arguments both ways. In my experience, however, there are many, many different examples of when it should mean “Don’t index the URL”. In these instances, if the URL was indexed, it would result in something bad happening either for searchers, or the site owner, or both. Therefore, generally, I think it should mean “Don’t index the URL”.

However, there is one specific case where I think it would be acceptable to index the URL, and which would give benefit to both searchers and site owners (very often). That specific case is when the URL is the home page of the site.

Taking the three “problem” URLs cited by Matt in his post:

If high-profile sites like

- http://www.police.go.kr/main/index.do (the National Police Agency of Korea)
- http://www.nmc.go.kr/ (the National Medical Center of Korea)
- http://www.yonsei.ac.kr/ (Yonsei University)

aren’t showing up in Google because of the NOINDEX meta tag, that’s bad for users

These three URLs are all actually home pages. The second and third URLs are obviously so. The first URL is the result of a couple of 302 redirects:

  • http://www.police.go.kr/ is a 302 to http://www.police.go.kr/index.jsp
  • http://www.police.go.kr/index.jsp is a 302 to http://www.police.go.kr/main/index.do

This makes http://www.police.go.kr/main/index.do the home page of the site. The way Google works (correctly IMO) is that a redirect from “/” to a deeper page on a site would normally result in the content of that deeper URL being indexed under “/”.

So, I think a reasonable middle ground, that satisfies the best interests of searchers, site owners and search engine, would be the following:

  1. Do not index the content.
  2. Do not link to the URL in the search results, unless the URL is a “home page” (/, or redirected to by /).
  3. If it is a home page with a NOINDEX tag, it’s OK to link to it in the SERPs, but do not index the content; do not provide a snippet; and do not provide a cached copy. Treat it like a “partially indexed page”.

No tags

Poor old Ming Campbell. Literally, “old” Ming Campbell has resigned/been ousted from the leadership of the Liberal Democrats, the UK’s third political party, because he is too old at 66. Supposedly, in this News 24 society, you need to be young, dynamic and good looking in order to attract votes and one of his likely successors is said to be telegenic enough to fit the bill.

Telegenic. What a horrible word, a real cut and shut job (photogenic and television, but the result should mean “produced at a distance”, not “looks good on television”). But it got me thinking … the phrase “search engine friendly” has always seemed so clumsy. So what about “robogenic” as a one-word equivalent, meaning “search engine friendly”, or “looks good to a robot”.

robogenic
search engine friendly; looks good to a robot

I like it. Unfortunately, robogenic should literally mean “Produced by a robot”, in the same way as photogenic literally means “Produced by light”. Ah, so what? I still like it. :D

No tags

This article was first written and published by me in 2000/2001, but no longer exists on the Web. It’s still accurate – although search engines (notably Google) have taken steps to correct some of the problems described below, they can and do still arise.

There are two common protocols for the prevention of indexing of Web resources:

  1. The robots.txt protocol
  2. The robots meta tag protocol

This article describes:

  • The theory and practice of these two protocols
  • Anomalies and inadequacies in the protocols

The robots.txt protocol

A search engine spider is a Web robot and, as such, may choose to obey the robots.txt protocol. The robots.txt protocol was invented in 1994 and has remained as the de facto standard for controlling robots’ access to a Web site. Most search engines claim to support it, but no robot, including a search engine spider, has to support it.

The protocol is described in the document “A Standard for Robot Exclusion”. That is the page that most search engines that support the robots.txt protocol will refer you to if you require more details. However, if you read that page, you will see that it contains no reference to search engines at all. The introduction to the page says:

In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren’t welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren’t suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting).

These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed. This standard addresses this need with an operational solution.

So, the purpose of the robots.txt protocol is to provide a mechanism for WWW servers to indicate to robots which parts of their server should not be accessed, i.e. to prevent robots from reading parts of their server. How does this purpose relate to preventing a search engine from indexing a particular resource? Unfortunately, the general answer to this question is “It doesn’t”.

The Disallow line in a robots.txt file means “disallow reading”, but that does NOT mean “disallow indexing”. In other words a disallowed resource may be listed in a search engine’s index, even if the search engine obeys the protocol. The most obvious demonstration of this is Google. Google can add files to its index without reading them, merely by considering links to those files. In theory, Google can build an index of an entire Web site without ever visiting that site or ever retrieving its robots.txt file. In so doing it is not breaking the robots.txt protocol because it is not reading any disallowed resources, it is simply reading other web sites’ links to those resources.

The Disallow line in a robots.txt file means “Disallow reading”, it does not mean “Disallow indexing”. A resource does not necessarily need to be read in order to be indexed.

Let’s return to the question of how the robots.txt file can be used to prevent a search engine from listing a particular resource in its index. In practise most search engines have placed their own interpretation on the robots.txt file which allows it to be used to prevent them adding resources to their index, as follows. Most search engines interpret a resource being disallowed by the robots.txt file as meaning they should not add it to their index, and if it is already in their index (placed there by previous spidering activity) they remove it. This last point is important, and an example will illustrate the point.

A particular resource may have been published to a particular Web site on 1st January 2000. That resource may have been indexed by a search engine on 1st February 2000. On 1st March 2000, the site owner may have modified the site’s robots.txt file to disallow the resource from being read by the search engine spider. On 1st April 2000, the search engine spider may re-visit the Web site and note the new entry in the robots.txt file. The search engine spider may now simply choose not to read the resource but to leave the copy of the resource in its index unchanged, and this would not be breaking the robots.txt protocol. But most search engine spiders will both:

  1. not read the resource and
  2. remove the resource from their index.

In this example, note that throughout March the resource was in the search engine’s index even though it was disallowed by the robots.txt file.

In practice, most search engines interpret a Disallow line as meaning “Do not index this resource and, if you already have an index of this resource, remove it”. It may take some time from the point a resource is Disallowed to the point that resource is removed from a particular search engine’s index. If you want to ensure a particular resource is never indexed, ensure it is prevented from being indexed by a Disallow line in the robots.txt file before publishing the resource for the first time.

Now let’s consider how the robots.txt protocol can be used to prevent binary resources, such as images (e.g. GIF files), from being added to a search engine’s index. Let’s suppose a particular Web site put all its images in a directory called /images, and had the following robots.txt file:


User-agent: *
Disallow: /images/

You might think that this would prevent the site’s images being indexed by image search engines. But think again about what we have learned about the robots.txt file. It prevents Web robots, including search engine spiders, from reading a resource. But search engines do not need to read an image before adding it to their index. Many spiders just read the ALT text of the IMG tags that refer to the image, rather than reading the image itself. Since the spiders are not reading the image, they are not in breach of the robots.txt protocol if they index the image. This scenario is analogous to Google building an index of a resource without reading that resource: an image search engine can build an index of an image without reading an image.

Once again, in practise most image search engines interpret a Disallow line referring to an image as meaning “Do not index this image and, if you already have an index of this image, remove it”. It may take some time from the point an image is Disallowed to the point that image is removed from a particular image search engine’s index.

Finally, a question that exposes the worst flaw of the robots.txt protocol: a webmaster wishes to make all pages of a Web site, EXCEPT the home page (i.e. “/”), accessible to robots; how can she do this using the robots.txt protocol? The answer – “She can’t”.

The robots meta tag protocol

The robots meta tag protocol was invented after the robots.txt protocol. It was originally designed to allow HTML developers that did not have permission to write the robots.txt file to the root of a server to have control over the indexing of Web pages. Unlike the robots.txt protocol, the robots meta tag protocol:

  1. specifically states whether a resource may or may not be indexed
  2. can help, but cannot prevent, a particular resource from being read
  3. does not allow large-scale (wildcard) prevention of indexing
  4. cannot be used to prevent anything except HTML files from being indexed, since the meta tag can only be placed in HTML files (if following the strict definition of the protocol)

Note in particular point 2: the robots meta tag protocol cannot prevent a particular resource from being read because a resource must be read in order to obtain the tag it contains. You may think that if every document that linked to a particular resource contained a robots meta tag NOFOLLOW attribute, that resource could never be read – but what if a new document is added to anywhere on the Web, and that document links to the resource? Or what if somebody submits the resource directly to the Add URL page of a search engine? In both these cases, a search engine will read the resource before discovering the robots meta tag. So the problems the robots.txt protocol was designed to fix – e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting) – are not addressed by the robots meta tag protocol. In other words, there is no “NOREAD” attribute!

So, we’ve said what the robots meta tag is not, but what is it? The robots meta tag is included in a HTML file and defines separately whether the file may be indexed (using the INDEX attribute) or spidered (using the FOLLOW attribute). However, the robots meta tag enjoys less support than the robots.txt file. It is unclear how much of the standard search engines support. Would every search engine, for example, correctly interpret a “noindex, follow” set of attributes?

Since the robots meta tag can only be used within a HTML file, and the NOINDEX attribute only refers to the file that contains it, it cannot be used to prevent binary resources (such as images) from being indexed. Some search engines have invented extensions to the protocol to overcome this problem, but the extensions are not part of the protocol. For example, AltaVista has invented its own robots meta tag attribute (NOIMAGEINDEX) to prevent images from being indexed.

The behaviour of these extension tags is not well defined. An example will illustrate the main problem:

  1. a particular Web site, let’s call it www.example-one.com, consists of 10 pages
  2. each of the 10 pages includes an image at www.example-one.com/images/example.gif
  3. nine of the ten pages contain a robots meta tag like this: <META NAME=”robots” CONTENT=”index,follow,noimageindex”>
  4. however, www.example-one.com’s home page contains the following robots meta tag: <META NAME=”robots” CONTENT=”index,follow”>

The “noimageindex” attribute is only understood by AltaVista’s image spider. So, when AltaVista’s image spider reads the site, will it add example.gif to AltaVista’s image index? The answer to this is question is undefined – nine out of ten pages say it’s not OK to index the image, but one out of ten pages says (implicitly) that it is OK. So the image spider might, or might not, index the image. It all depends on the order the spider reads the pages, the number of pages read by the spider (it might only read the home page), and a multitude of other factors.

To make matters worse, now suppose that there is another Web site called www.example-two.com, every page of which also includes www.example-one.com/images/example.gif. None of the pages on www.example-two.com include a robots meta tag. Would an image spider add example.gif to its index now? Again, the answer to this question is undefined.

Now a question to test the theory so far … A site owner attempts to exclude a page from being indexed by search engines by both adding a Disallow line in the site robots.txt file and a meta robots tag with noindex attribute into the page itself, before publishing the resource for the first time. Is there any way that a search engine that obeys the robots.txt protocol and the robots meta tag meticulously can have a reference to the resource in its index?

Let’s work this through.

  1. Suppose the resource is called noindex.htm and it contains the following robots meta tag: <META NAME=”robots” CONTENT=”noindex,nofollow”>
  2. The URL http://www.example-three.com/robots.txt is then created as follows:
    User-agent: *
    Disallow: /noindex.htm
  3. noindex.htm is then published to www.example-three.com/noindex.htm for the first time.

Surely noindex.htm can’t possibly be indexed by a search engine that obeys the robots.txt protocol and the robots meta tag protocol? Can it? It can. In fact, only a search engine that completely obeys both standards can index it. Here’s how.

Our very obedient search engine works a little like Google. So, while its spider is spidering the Web, it finds references to noindex.htm. Each time it finds a reference, the spider creates a better picture of noindex.htm in its index, without ever reading noindex.htm. Sooner or later, the spider visits www.example-three.com. The first thing it does is read robots.txt to find pages it is not allowed to read. The only page it is not allowed to read is noindex.htm, so it doesn’t read that page. It doesn’t remove the page from its index, because, strictly speaking, that is not what the robots.txt protocol means. Because the spider cannot read noindex.htm, it cannot find the robots tag on that page preventing it from indexing that page. Therefore, the page remains in the search engine’s index.

Future posts will address the new features in robots.txt, the robots meta tag and Webmaster tools, that address some of the above problems.

No tags

Older posts >>

Theme Design by devolux.org