Archive for the ‘SEO’ Category

When should “NOINDEX” mean “INDEX”?

Friday, February 29th, 2008

Matt Cutts has stirred up a little hornets’ nest with his “What should NOINDEX do?” post. Matt reckons the topic will be colossally boring to some people - but not to me. For some reason I find Robots standards fascinating. Yep, I know I’m weird.

The crux of Matt’s issue is …

The question is whether Google should completely drop a NOINDEX’ed page from our search results vs. show a reference to the page, or something in between?

The obvious response is to completely drop the NOINDEX’ed page. NOINDEX is made up of the two words NO and INDEX; so it means do not index, right?

Maybe not. It’s important to be precise here. What exactly does NOINDEX mean?

Often when talking about indexing issues, it’s useful to separate in your mind the indexing of a URL from the indexing of the content at that URL. This concept is particularly important in the contexts of URL canonicalization, duplicate content and … robots standards. I’ll restrict this discussion to the NOINDEX part of the robots standards, but an equally interesting discussion exists around robots.txt too.

Once we separate URL and content, the question “What exactly does NOINDEX mean?” can be answered in several ways:

1) Index the URL but not the content
2) Don’t index the URL or the content
3) (Somehow, not sure how!) index the content but not the URL

One thing is for sure … it does not mean index both the content and the URL. :D

In my opinion NOINDEX should definitely mean “Don’t index the content”. Definitely. No question.

The question of whether it should mean “Don’t index the URL” is an interesting one. There are arguments both ways. In my experience, however, there are many, many different examples of when it should mean “Don’t index the URL”. In these instances, if the URL was indexed, it would result in something bad happening either for searchers, or the site owner, or both. Therefore, generally, I think it should mean “Don’t index the URL”.

However, there is one specific case where I think it would be acceptable to index the URL, and which would give benefit to both searchers and site owners (very often). That specific case is when the URL is the home page of the site.

Taking the three “problem” URLs cited by Matt in his post:

If high-profile sites like

- http://www.police.go.kr/main/index.do (the National Police Agency of Korea)
- http://www.nmc.go.kr/ (the National Medical Center of Korea)
- http://www.yonsei.ac.kr/ (Yonsei University)

aren’t showing up in Google because of the NOINDEX meta tag, that’s bad for users

These three URLs are all actually home pages. The second and third URLs are obviously so. The first URL is the result of a couple of 302 redirects:

  • http://www.police.go.kr/ is a 302 to http://www.police.go.kr/index.jsp
  • http://www.police.go.kr/index.jsp is a 302 to http://www.police.go.kr/main/index.do

This makes http://www.police.go.kr/main/index.do the home page of the site. The way Google works (correctly IMO) is that a redirect from “/” to a deeper page on a site would normally result in the content of that deeper URL being indexed under “/”.

So, I think a reasonable middle ground, that satisfies the best interests of searchers, site owners and search engine, would be the following:

  1. Do not index the content.
  2. Do not link to the URL in the search results, unless the URL is a “home page” (/, or redirected to by /).
  3. If it is a home page with a NOINDEX tag, it’s OK to link to it in the SERPs, but do not index the content; do not provide a snippet; and do not provide a cached copy. Treat it like a “partially indexed page”.

Robogenic - the one word expression for “search engine friendly”

Friday, October 19th, 2007

Poor old Ming Campbell. Literally, “old” Ming Campbell has resigned/been ousted from the leadership of the Liberal Democrats, the UK’s third political party, because he is too old at 66. Supposedly, in this News 24 society, you need to be young, dynamic and good looking in order to attract votes and one of his likely successors is said to be telegenic enough to fit the bill.

Telegenic. What a horrible word, a real cut and shut job (photogenic and television, but the result should mean “produced at a distance”, not “looks good on television”). But it got me thinking … the phrase “search engine friendly” has always seemed so clumsy. So what about “robogenic” as a one-word equivalent, meaning “search engine friendly”, or “looks good to a robot”.

robogenic
search engine friendly; looks good to a robot

I like it. Unfortunately, robogenic should literally mean “Produced by a robot”, in the same way as photogenic literally means “Produced by light”. Ah, so what? I still like it. :D

The robots.txt file and the robots meta tag

Tuesday, September 18th, 2007

This article was first written and published by me in 2000/2001, but no longer exists on the Web. It’s still accurate - although search engines (notably Google) have taken steps to correct some of the problems described below, they can and do still arise.

There are two common protocols for the prevention of indexing of Web resources:

  1. The robots.txt protocol
  2. The robots meta tag protocol

This article describes:

  • The theory and practice of these two protocols
  • Anomalies and inadequacies in the protocols

The robots.txt protocol

A search engine spider is a Web robot and, as such, may choose to obey the robots.txt protocol. The robots.txt protocol was invented in 1994 and has remained as the de facto standard for controlling robots’ access to a Web site. Most search engines claim to support it, but no robot, including a search engine spider, has to support it.

The protocol is described in the document “A Standard for Robot Exclusion”. That is the page that most search engines that support the robots.txt protocol will refer you to if you require more details. However, if you read that page, you will see that it contains no reference to search engines at all. The introduction to the page says:

In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren’t welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren’t suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting).

These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed. This standard addresses this need with an operational solution.

So, the purpose of the robots.txt protocol is to provide a mechanism for WWW servers to indicate to robots which parts of their server should not be accessed, i.e. to prevent robots from reading parts of their server. How does this purpose relate to preventing a search engine from indexing a particular resource? Unfortunately, the general answer to this question is “It doesn’t”.

The Disallow line in a robots.txt file means “disallow reading”, but that does NOT mean “disallow indexing”. In other words a disallowed resource may be listed in a search engine’s index, even if the search engine obeys the protocol. The most obvious demonstration of this is Google. Google can add files to its index without reading them, merely by considering links to those files. In theory, Google can build an index of an entire Web site without ever visiting that site or ever retrieving its robots.txt file. In so doing it is not breaking the robots.txt protocol because it is not reading any disallowed resources, it is simply reading other web sites’ links to those resources.

The Disallow line in a robots.txt file means “Disallow reading”, it does not mean “Disallow indexing”. A resource does not necessarily need to be read in order to be indexed.

Let’s return to the question of how the robots.txt file can be used to prevent a search engine from listing a particular resource in its index. In practise most search engines have placed their own interpretation on the robots.txt file which allows it to be used to prevent them adding resources to their index, as follows. Most search engines interpret a resource being disallowed by the robots.txt file as meaning they should not add it to their index, and if it is already in their index (placed there by previous spidering activity) they remove it. This last point is important, and an example will illustrate the point.

A particular resource may have been published to a particular Web site on 1st January 2000. That resource may have been indexed by a search engine on 1st February 2000. On 1st March 2000, the site owner may have modified the site’s robots.txt file to disallow the resource from being read by the search engine spider. On 1st April 2000, the search engine spider may re-visit the Web site and note the new entry in the robots.txt file. The search engine spider may now simply choose not to read the resource but to leave the copy of the resource in its index unchanged, and this would not be breaking the robots.txt protocol. But most search engine spiders will both:

  1. not read the resource and
  2. remove the resource from their index.

In this example, note that throughout March the resource was in the search engine’s index even though it was disallowed by the robots.txt file.

In practice, most search engines interpret a Disallow line as meaning “Do not index this resource and, if you already have an index of this resource, remove it”. It may take some time from the point a resource is Disallowed to the point that resource is removed from a particular search engine’s index. If you want to ensure a particular resource is never indexed, ensure it is prevented from being indexed by a Disallow line in the robots.txt file before publishing the resource for the first time.

Now let’s consider how the robots.txt protocol can be used to prevent binary resources, such as images (e.g. GIF files), from being added to a search engine’s index. Let’s suppose a particular Web site put all its images in a directory called /images, and had the following robots.txt file:


User-agent: *
Disallow: /images/

You might think that this would prevent the site’s images being indexed by image search engines. But think again about what we have learned about the robots.txt file. It prevents Web robots, including search engine spiders, from reading a resource. But search engines do not need to read an image before adding it to their index. Many spiders just read the ALT text of the IMG tags that refer to the image, rather than reading the image itself. Since the spiders are not reading the image, they are not in breach of the robots.txt protocol if they index the image. This scenario is analogous to Google building an index of a resource without reading that resource: an image search engine can build an index of an image without reading an image.

Once again, in practise most image search engines interpret a Disallow line referring to an image as meaning “Do not index this image and, if you already have an index of this image, remove it”. It may take some time from the point an image is Disallowed to the point that image is removed from a particular image search engine’s index.

Finally, a question that exposes the worst flaw of the robots.txt protocol: a webmaster wishes to make all pages of a Web site, EXCEPT the home page (i.e. “/”), accessible to robots; how can she do this using the robots.txt protocol? The answer - “She can’t”.

The robots meta tag protocol

The robots meta tag protocol was invented after the robots.txt protocol. It was originally designed to allow HTML developers that did not have permission to write the robots.txt file to the root of a server to have control over the indexing of Web pages. Unlike the robots.txt protocol, the robots meta tag protocol:

  1. specifically states whether a resource may or may not be indexed
  2. can help, but cannot prevent, a particular resource from being read
  3. does not allow large-scale (wildcard) prevention of indexing
  4. cannot be used to prevent anything except HTML files from being indexed, since the meta tag can only be placed in HTML files (if following the strict definition of the protocol)

Note in particular point 2: the robots meta tag protocol cannot prevent a particular resource from being read because a resource must be read in order to obtain the tag it contains. You may think that if every document that linked to a particular resource contained a robots meta tag NOFOLLOW attribute, that resource could never be read – but what if a new document is added to anywhere on the Web, and that document links to the resource? Or what if somebody submits the resource directly to the Add URL page of a search engine? In both these cases, a search engine will read the resource before discovering the robots meta tag. So the problems the robots.txt protocol was designed to fix - e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting) – are not addressed by the robots meta tag protocol. In other words, there is no “NOREAD” attribute!

So, we’ve said what the robots meta tag is not, but what is it? The robots meta tag is included in a HTML file and defines separately whether the file may be indexed (using the INDEX attribute) or spidered (using the FOLLOW attribute). However, the robots meta tag enjoys less support than the robots.txt file. It is unclear how much of the standard search engines support. Would every search engine, for example, correctly interpret a “noindex, follow” set of attributes?

Since the robots meta tag can only be used within a HTML file, and the NOINDEX attribute only refers to the file that contains it, it cannot be used to prevent binary resources (such as images) from being indexed. Some search engines have invented extensions to the protocol to overcome this problem, but the extensions are not part of the protocol. For example, AltaVista has invented its own robots meta tag attribute (NOIMAGEINDEX) to prevent images from being indexed.

The behaviour of these extension tags is not well defined. An example will illustrate the main problem:

  1. a particular Web site, let’s call it www.example-one.com, consists of 10 pages
  2. each of the 10 pages includes an image at www.example-one.com/images/example.gif
  3. nine of the ten pages contain a robots meta tag like this: <META NAME=”robots” CONTENT=”index,follow,noimageindex”>
  4. however, www.example-one.com’s home page contains the following robots meta tag: <META NAME=”robots” CONTENT=”index,follow”>

The “noimageindex” attribute is only understood by AltaVista’s image spider. So, when AltaVista’s image spider reads the site, will it add example.gif to AltaVista’s image index? The answer to this is question is undefined – nine out of ten pages say it’s not OK to index the image, but one out of ten pages says (implicitly) that it is OK. So the image spider might, or might not, index the image. It all depends on the order the spider reads the pages, the number of pages read by the spider (it might only read the home page), and a multitude of other factors.

To make matters worse, now suppose that there is another Web site called www.example-two.com, every page of which also includes www.example-one.com/images/example.gif. None of the pages on www.example-two.com include a robots meta tag. Would an image spider add example.gif to its index now? Again, the answer to this question is undefined.

Now a question to test the theory so far … A site owner attempts to exclude a page from being indexed by search engines by both adding a Disallow line in the site robots.txt file and a meta robots tag with noindex attribute into the page itself, before publishing the resource for the first time. Is there any way that a search engine that obeys the robots.txt protocol and the robots meta tag meticulously can have a reference to the resource in its index?

Let’s work this through.

  1. Suppose the resource is called noindex.htm and it contains the following robots meta tag: <META NAME=”robots” CONTENT=”noindex,nofollow”>
  2. The URL http://www.example-three.com/robots.txt is then created as follows:
    User-agent: *
    Disallow: /noindex.htm
  3. noindex.htm is then published to www.example-three.com/noindex.htm for the first time.

Surely noindex.htm can’t possibly be indexed by a search engine that obeys the robots.txt protocol and the robots meta tag protocol? Can it? It can. In fact, only a search engine that completely obeys both standards can index it. Here’s how.

Our very obedient search engine works a little like Google. So, while its spider is spidering the Web, it finds references to noindex.htm. Each time it finds a reference, the spider creates a better picture of noindex.htm in its index, without ever reading noindex.htm. Sooner or later, the spider visits www.example-three.com. The first thing it does is read robots.txt to find pages it is not allowed to read. The only page it is not allowed to read is noindex.htm, so it doesn’t read that page. It doesn’t remove the page from its index, because, strictly speaking, that is not what the robots.txt protocol means. Because the spider cannot read noindex.htm, it cannot find the robots tag on that page preventing it from indexing that page. Therefore, the page remains in the search engine’s index.

Future posts will address the new features in robots.txt, the robots meta tag and Webmaster tools, that address some of the above problems.

How much PageRank does a page that is not in the index have?

Friday, September 14th, 2007

If a page has a NOINDEX tag on it, how much PageRank does it have? The intuitive answer would be “None”. How can a page that is not indexed have PageRank? Wouldn’t it be treated like a dangling link and disregarded during PageRank calculations?

Apparently not. Matt Cutts states on seomoz:

Does a link from a page with meta robots=”noindex, follow” carry less weight? no weight?

For Google, I believe such links would carry the same weight as normal links on regular pages.

Hmmm. Does he mean that the unindexed page actually has a PageRank? Or does he mean that the zero Pagerank that the unindexed page has would be divided out among the links on the page, giving nothing to each? I wonder …

One thing’s for sure … if “NOINDEX, FOLLOW” works as implied, it’s a great way to inject spammy content and links.

Google, Paid Links, The FTC and Deceptive Advertising

Friday, September 14th, 2007

Thanks to Dan Thies for drawing my attention to the latest “mayhem” surrounding Google, rel=nofollow and the FTC. This is an area close to my heart, as my article from 2005, Search Marketing & The Law, made clear:

It would be foolish to expect to be operating in a multi-billion dollar global marketing industry and not expect to comply with marketing laws and regulations in the countries in which you are marketing.

The current confusion stems from Matt Cutts’ blog post on paid links back in April, which called for both human readable and machine readable disclosure of paid links - machine readable first:

If you want to sell a link, you should at least provide machine-readable disclosure for paid links by making your link in a way that doesn’t affect search engines. There’s a ton of ways to do that. For example, you could make a paid link go through a redirect where the redirect url is robot’ed out using robots.txt. You could also use the rel=nofollow attribute.

The problem here is that there is no machine-readable disclosure for paid links. Matt suggests that there a “ton” of ways, but none of these ways mean “this link is paid”, let alone the means, method and motive for payment. This is where the confusion starts.

Matt then goes on to discuss human-readable disclosure:

The other best practice I’d advise is to provide human readable disclosure that a link/review/article is paid.

Here I fully agree with Matt - it’s important not to mislead your visitors. No confusion here.

The real confusion seems to come from the next thing Matt says:

Google’s quality guidelines are more concerned with the machine-readable aspect of disclosing paid links/posts, but the Federal Trade Commission has said that human-readable disclosure is important too:

The petition to us did raise a question about compliance with the FTC act,” said Mary K. Engle, FTC associate director for advertising practices. “We wanted to make clear . . . if you’re being paid, you should disclose that.”

To make sure that you’re in good shape, go with both human-readable disclosure and machine-readable disclosure, using any of the methods I mentioned above.

Some people have inferred that Matt is saying that paid links that aren’t labelled in a machine-readable way are contravening the FTC guidelines. He isn’t saying this at all. Read carefully. The FTC is concerned with human-readable disclosure, not machine-readable disclosure. There is no machine-readable disclosure for paid links.

It is possible to place deceptive advertising in search results using various means. But failing to label a link as paid in a machine-readable way is not one of them. There is no machine-readable disclosure for paid links.

Google Defines Cloaking - Again!

Friday, June 8th, 2007

I see that Google have added to their Quality Guidelines, including a new, helpful(?) definition of cloaking:

Cloaking refers to the practice of presenting different content or URLs to users and search engines. Serving up different results based on user agent may cause your site to be perceived as deceptive and removed from the Google index.

Some examples of cloaking include:

  • Serving a page of HTML text to search engines, while showing a page of images or Flash to users.
  • Serving different content to search engines than to users.

That’s fairly clear then. :)

My own definition of cloaking is

Cloaking
The identification of a search engine spider by some feature of its IP address or HTTP request, and the resultant delivery of a response to that spider designed to game the search engine’s ranking algorithm.

My rule of thumb is that you should not need to know that a search engine is making the request in order to deliver a response to that request. The obvious exception to this rule of thumb is Paid Inclusion. Paid Inclusion isn’t cloaking. ;)

Google’s attitude to paid links - humble or arrogant?

Tuesday, May 15th, 2007

I see Matt Cutts has updated his “How to report paid links” post. I missed this one the first time around as he posted it on my 40th birthday. :)

Matt is head of Google’s Webspam team. We talked together quite a lot in 2001 when Matt was working on the Google Webmaster Guidelines. Matt had seen my talk on spam and cloaking at the 2001 Search Engine Strategies Conference in San Francisco, and had read my White Paper on The Classification of Search Engine Spam. He was particularly keen on my ideas of “Doing it for humans”. These ideas eventually found their way into Google’s Quality Guidelines as

Make pages for users, not for search engines. Don’t deceive your users or present different content to search engines than you display to users, which is commonly referred to as “cloaking.”

Since 2001, Matt and I have communicated regularly and have always seen eye-to-eye on issues of spam. That is, until now. I can’t help but feel that Matt’s stance on paid links and the application of the rel=nofollow attribute is the start of a slippery slope for Google.

Back in 2001, when we agreed on “Doing it for humans”, the principle behind this was that as a Web publisher you shouldn’t need to do anything specifically for search engines in order for them to crawl, index or rank your content. If you found yourself doing something specifically for search engines, that was the time to ask yourself whether you were spamming.

Fast forward to 2005 and Google introduced the rel=nofollow attribute; a contributor to that post was Matt Cutts himself. The best places to use this tag, according to the post, are

the actual links that other people can create [...] for instance, only the links within comments and the link immediately after “Posted by:” would get the rel=”nofollow” attribute.

This seemed like a great idea at the time. With hindsight, it marked the introduction of a tag that was specifically designed to affect search engine ranking algorithms:

At the heart of our [Google's] technology is PageRank™, a system for ranking web pages … [that] relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page’s value.

By asking publishers to label links with rel=nofollow, Google was giving those publishers the power to affect PageRank. Nothing more and nothing less.

Search engines offer tags and other techniques to prevent crawling or indexing of Web pages - notably robots.txt files and the robots meta tag. But not since the ill-conceived and now much deprecated meta keywords tag has a search engine offered the ability to control rankings, in the way that the rel=nofollow attribute does.

So, the problem with rel=nofollow is that it’s “Doing it for search engines”, not the “Doing it for humans” that Matt Cutts and I agreed on back in 2001. Matt covered this in his post:

That same philosophy would mean that you wouldn’t create a robots.txt file (users don’t check those), never make any meta tags (users don’t see meta tags), never create an XML sitemap file (users wouldn’t know about them), and wouldn’t create web pages that validate (users wouldn’t notice). Yet these are all great practices to do.

However, there are two big differences between rel=nofollow and these examples quoted by Matt:

  1. rel=nofollow is not designed to affect crawling, or indexing, which are naturally of interest to Webmasters (as it’s their site and their content); but ranking, which is the preserve of the search engine (as it’s their algorithm).

  2. Failure to know about or deploy robots.txt, meta tags or sitemaps is not search engine spam. However, Matt is saying that failing to use rel=nofollow could be treated as spam; so a Webmaster who is buying links needs to know of the existence of rel=nofollow in order to avoid spamming.

I can see why rel=nofollow was felt necessary, and I agree it’s a good idea in certain circumstances (particularly when built into software such as Wordpress, rather than expecting individual publishers to know about it and apply it). What I can’t understand is why Google is now asking publishers to label paid links with rel=nofollow. The labelling of paid links was never mentioned as an application when rel=nofollow was introduced. There are some big problems with this approach:


  1. Even when rel=nofollow was introduced the Web was over 10 years old. What about all the old links, that were made before rel=nofollow existed? Are publishers supposed to go back and change them? All of them?

  2. Since rel=nofollow was introduced, it has become relatively well known in the dedicated search marketing and blogging community. But what efforts have been made to make it known to everyday Web designers and publishers?

  3. What exactly constitutes a paid link? A link to a parent company? A link to a partner company who has supplied you work in the past? What about a link to somebody who bought you a beer at a conference once? Or is it only a link for which money specifically was exchanged? In some commercial areas of the Web, depending on your point of view, all links could be considered paid. Although I don’t work in the pills/porn/casino industries, even categories such as financial services are so highly commercialised that almost all links could be considered as paid. Within such a category, if all links were labelled with rel=nofollow (as Google appears to want), what actual use would rel=nofollow be?

Given the above problems, I can’t quite work out whether Google is being humble or arrogant in asking publishers to label their links…

  • Humble: It’s like Google is admitting that it can no longer detect and properly compensate for paid links algorithmically. The heart of their ranking technology is links, and link spam is hurting so badly that they are asking for help.

  • Arrogant: It’s like Google is starting to lay down how the Web will be built. Rather than reacting to how people build Web sites, they are telling people how to build Web sites.

I am not enamoured with either of these possibilities.

Skip Google Hell, Time For Google Heaven

Tuesday, May 1st, 2007

What a dreadful, poorly researched article in Forbes magazine. There’s so much wrong with it, I can’t find a single good thing to say - so I’ll say nothing more about it. :|

In brief, the way to Google Heaven is:

  1. Create good quality, unique content
  2. Ensure the content can be crawled and indexed by Google
  3. Take steps to ensure that the content is seen once only, at the best URL for it
  4. Build good quality links to the content from your own sites and those of relevant third parties

It’s a shame the sites featured in the Forbes article failed to follow this simple formula.

Dixons Technical Architecture

Tuesday, May 1st, 2007

Reading the Sunday Times this weekend, there was an interesting article on Full HD TVs. The Sharp LC-37XD1E looked good value, so I checked Dixons. They didn’t stock it, but they do stock the 42″ model, the SHARP LC42XD1E Flat Panel TV.

Enough about TVs. See that great link I just gave Dixons? A deep link direct to a product page, labelled with the product text. That link should help Dixons to rank well for the SHARP LC42XD1E Flat Panel TV. Unfortunately for Dixons, it won’t help as much as it could. Just look at the URL of the link:

http://www.dixons.co.uk/martprd/store/dix_page.jsp?
BV_SessionID=@@@@2111405657.1178042616@@@@&
BV_EngineID=ccckaddkkmmglhhcflgceggdhhmdgml.0&
page=Product&fm=null&sm=null&tm=null&sku=317042&
category_oid=-28723

That’s a bad URL. It’s encoded with Session IDs, Engine IDs and null parameters. That’s not the sort of link a search engine would like to crawl, and even if a search engine did manage to crawl and index the content at that URL, it’s unlikely that a searcher visiting the URL several weeks later, as a result of searching for a SHARP LC42XD1E Flat Panel TV, would see any content. The session ID would be long expired. This URL is produced by BroadVision, an e-commerce application used by Dixons.

Moral: expensive, high end Web applications don’t necessarily produce marketable, search-friendly sites.

Bringing Down Google With Two Simple Lines of Code

Friday, April 27th, 2007

Is Google too powerful? It’s a question asked by many. But much of Google’s future depends on two simple lines of code.

When Google floated, its SEC Filing listed many potential future threats to its business:

  • Our ability to compete effectively.
  • Our ability to continue to attract users to our web sites.
  • The level of use of the Internet to find information.
  • Our ability to attract advertisers to our AdWords program.
  • Our ability to attract web sites to our AdSense program.
  • The mix in our net revenues between those generated on our web sites and those generated through our Google Network.
  • The amount and timing of operating costs and capital expenditures related to the maintenance and expansion of our businesses, operations and infrastructure.
  • Our focus on long term goals over short-term results.
  • The results of our investments in risky projects.
  • General economic conditions and those economic conditions specific to the Internet and Internet advertising.
  • Our ability to keep our web sites operational at a reasonable cost and without service interruptions.
  • The success of our geographical and product expansion.
  • Our ability to attract, motivate and retain top-quality employees.
  • Foreign, federal, state or local government regulation that could impede our ability to post ads for various industries.
  • Our ability to upgrade and develop our systems, infrastructure and products.
  • New technologies or services that block the ads we deliver and user adoption of these technologies.
  • The costs and results of litigation that we face.
  • Our ability to protect our intellectual property rights.
  • Our ability to forecast revenue from agreements under which we guarantee minimum payments.
  • Our ability to manage click-through fraud and other activities that violate our terms of services.
  • Our ability to successfully integrate and manage our acquisitions.
  • Geopolitical events such as war, threat of war or terrorist actions.

One thing they never mentioned (explicitly anyway) was “Site owners continue to give us permission to crawl and index their sites”. Without that permission, a large part of Google’s business model disappears.

The permission can be taken away with two simple lines of code placed in a site’s robots.txt file:

User-agent: Googlebot
Disallow: /

Sure, every site owner in the world would need to publish this file to their sites. But if they did such a thing, the Google search engine could no longer crawl or index any of the Web’s content. It would be defunct.

So, fellow site owners, Google’s future is in our hands. If you want to go “on strike” and stop Google profiting from the fruits of your labours, simply publish the code. Be warned that your site will eventually be removed from Google’s index if you do so. As a unilateral step, this may do you more harm than good. But if we all do it en masse, then beware Google!