14
Google Adwords trademark policies: what’s the billion dollar question?
0 Comments | Posted by Alan Perkins in Google, PPC
The Google Adwords trademark policy is aimed at balancing the interests of trademark holders, advertisers and internet users. Does it always do this, or can it sometimes provide a method for trademark holders to restrict competition and potentially cause harm to advertisers, internet users and Google itself? That’s the billion dollar question.
Google Adwords is a paid search marketing program offered by Google that allows millions of organisations around the world to advertise their products and services in Google’s search results. That makes Adwords a big deal. Adwords accounts for the lions’ share of Google’s revenues, which totalled $16.5bn in 2007 alone.
Yahoo! and Microsoft offer similar programs to Adwords. However, Google is the market leader, with estimates of its paid search market share ranging from 58% upwards. Google clearly holds a dominant position within the paid search marketplace, so its policy decisions matter.
Google’s dominance has created a significant demand within Adwords from third party advertisers who would like to market products and services against the results of popular trademarks which they do not own. As a result, there have been several instances where Google has faced legal action by trademark holders trying to restrict third parties bidding on those search terms relating to their trademarks. Trademark holders in the US, such as Geico and American Airlines, have previously filed suit . In Europe, Google has been sued by the likes of Louis Vuitton in France .
These legal actions led to the introduction of the Google Adwords Trademark Policy. There are in fact two policies, one or other of which is in force in any location around the world. These policies allow the trademark holder to exert significant influence over the use of their marks within the Adwords program.
Whilst it may seem a reasonable response on the part of Google to seek to recognise and protect the rights of trademark owners, especially in response to suggestions Google may be facilitating passing off and/or infringement of registered trademarks, the problem is that the Google Adwords Trademark Policy may in fact give far more power to trademark holders than they need to protect their goodwill and prevent passing off. Google’s trademark policies may fail to recognise the legitimate right of third parties to use registered trademarks which they do not own to legally sell products and services which they have a right to sell and facilitate Trademark holder to restrict free trade in goods and services.
For example, in the motor market, many private individuals, non-franchise and franchise dealers have a legitimate right to use manufacturer and model trademarks in order to describe a car or range of cars they wish to advertise.
An example would be if you wished to sell your Peugeot 308. Do you really want to have to call it a mid size French 1.9 litre diesel hatchback? Somehow the sale is much more likely to happen if you just call it by its make and model rather than a bland description.
Clearly in this example there is no passing off and no loss of goodwill. It is completely understood by all parties that the advertiser of the car is not necessarily the trademark holder. Yet Google’s trademark policies mean that advertisers can be prevented from using trademarked terms even so. Has this policy really balanced the interests of trademark holders, advertisers and internet users, as Google purports to do? Commenting, Kevin McGuinness of London-based commercial law specialists Sabretooth Law stated
In restricting the use of trademarks Google may have diminished the ability of non-owners of trademarks to legitimately use such trademarks in the course of carrying on their trade. Given the size of the market in which Google operates and the importance of the advertising market to automobile resale sector this is likely to be an area where both English and European competition authorities may take an interest in arrangements which potentially restrict competition to the detriment of the general public.
Antitrust or anti-competition issues have been one area where both the UK and European competition authorities have consistently demonstrated a keen interest in protecting the European consumer and Google’s dominant position in the paid search marketing sector would suggest it needs to ensure its policies are legal, not only in the US but also in Europe.
In the UK, an organisation can be fined 10% of its worldwide annual revenues for engaging in anti-competitive behaviour. As noted earlier, these amounted to $16.5bn for Google in 2007 alone, so 10% would be $1.65bn. That is a large number!
Is it Google’s responsibility, though, or is it the responsibility of the respective trademark holders? Or is it both?
It seems harsh to hold Google solely responsible, when Google has been simply trying to respect trademarks holders’ legitimate rights; especially in light of the fact that Google has been sued by several trademark holders and to some extent its trademark policy is a result of that. In addition, by restricting competition on some trademarked terms, Google may have impacted its own revenues. Kevin McGuinness again:
As Google is the participant in the on-line market place, which is itself restricting the availability for use of other persons’ trademarks, it could be that Google, not the trademark holders, may be found to be at fault. This hardly seems fair given Google’s long standing commitment to ethical good business practice.
Clearly Google does not exercise its trademark policy in isolation. Only when a trademark holder files a trademark complaint in the appropriate jurisdiction does Google exercise its policy. This is why you can see Google Adwords for lots of trademarked terms, but not all.
Evidence of how trademark holders are working with search engines came in a recent interview with New Media Age magazine (subscription required) when Steve Bowler, Marketing Manager of Land Rover, stated:
One of the areas that wasn’t looked at properly before was search. Previously it was recognised as being somewhat important yet ancillary to TV, press and outdoor. Now, though, we take search very seriously, working with the search engines on how to deal with issues like trademarking.
As a result, Kevin McGuinness states:
Competition authorities could conclude that Google and trademark holders are each using Google Adwords to prevent competition.
Not only Google but each individual trademark holder could be investigated and potentially fined up to 10% of global revenues. Trademark holders who have restricted their trademarks include Alfa Romeo, Peugeot and Land Rover.
Do the same issues also affect Yahoo! and Microsoft? No. Both of these search companies have much more targeted trademark policies. For example, Yahoo!’s policy is:
As applied to nominative uses of another’s trademark, Yahoo! Search Marketing requires advertisers to meet one of the following two conditions: … Reseller [... or ...] Information Site, Not Competitive
And Microsoft’s policy, though targeted, is elegantly simple:
Affiliates and resellers may bid on trademarked terms relevant to the goods, services, or sites that they promote.
Why does Google not have such a simple policy? Perhaps because, though simply stated, the Yahoo! and Microsoft policies require more editorial intervention than the Google policy, or perhaps because Google’s current policy arises from being sued by trademark holders, rather than being pursued by competition authorities. Google’s official response is posted on their Inside Adwords Blog:
We will not allow the use of a trademark term according to the parameters of the trademark complaint filed by the trademark owner. Therefore, unless the trademark owner specifically grants you permission to use their trademarked term by contacting our Trademark team, we are not able to approve the use of the trademark in your AdWords ads.
There is no explanation there, nor has one ever been offered on the many occasions Google has been given to comment on this issue, but one can only assume that Google believes it is on solid legal ground in operating this policy. The question is: are they correct?
Though a vast improvement on Google’s trademark policy, Yahoo!’s and Microsoft’s policies both restrict comparative advertising (advertising which “explicitly or by implication, identifies a competitor or goods or services offered by a competitor”). A recent European court case showed that such restrictions may be unlawful . However that is a different, and far less contentious, issue than the anti-competition issues raised by the Google Adwords Trademark Policy alone.
So, the question remains. Has Google and/or its advertisers been in contravention of UK or EU competition laws in exercising its trademark policy to date? Microsoft’s European court experience should provide ample evidence that American software giants need to be very careful within the European Union. Once the EU competition authorities decide to bite, they rarely let go of their prey quickly. Given the enmity between the two, will Microsoft be at the head of the line to point out the ongoing competition issues in Google’s trademark policies?
Google has, since its inception, been a beacon of best business practice, but it may be on the wrong side of this legal issue by trying to do the right thing by trademark holders who continue to abuse its policies in order to restrict fair competition. With fines of up to 10% of global turnover possible, it’s a high stakes issue.
SilverDisc is 15 years old today!
SilverDisc was established on March 10, 1993 by three people (Alan Perkins, Allan Todd and Eric Barfield) who had met while working on interactive CD applications at Philips in Surrey. That is why the company is called SilverDisc – because its first products and services were delivered on CD, a “silver disc”.
Alan, Allan and Eric were three guys who enjoyed making a living while having fun doing techie things. Well, they considered it fun anyway! In 1993 they were using very low speed dial-up modems and Compuserve to communicate with each other and a small world of CD developers. In 1994 they got their first Web server and started hosting Web sites for clients. In 1995 they developed a fully functional shop with online credit card capabilities for one client, and started hosting the Web site for HarperCollins, a major publisher – not bad going for three guys working from home and having fun.
Early Years of Search Marketing
In late 1995, on the day AltaVista launched, SilverDisc realised the potential that search engines held for marketing purposes. They started marketing themselves and their clients through search engine optimization, although that phrase was not in use at the time.
During the mid-to-late nineties SilverDisc continued to deliver CD and DVD products and services as well as Internet services and Internet marketing. It remained three guys having fun. Then, a few things happened in quick succession:
- Allan moved back to Scotland and got a real job working for somebody else – he wanted his young family to get a “proper Scottish education”
- Alan moved back to Northamptonshire, mainly to tap into a support network for his young family, but remained with SilverDisc
- Alan and Eric teamed up with a distant relative of Eric’s and formed a new company, e-Brand Management, to take advantage of the “dot com boom”
1998 to 2000 was spent developing patents, products and services based around some ideas that Alan had in the years since 1995. In parallel, SilverDisc continued to service its existing client base.
The patents were filed in 1999 and have since been granted. They cover some very fundamental search engine ground. One patent is in crawling and indexing, and the other is in personalisation – both are hot topics today, nine years later. The first product, Search Mechanics, was launched at the very first Search Engine Strategies to be held in the UK and e-Brand Management was one of only five exhibitors there.
That covers “SilverDisc – the early years”. In a future post, I’ll look at what has happened since 2000.
29
When should “NOINDEX” mean “INDEX”?
0 Comments | Posted by Alan Perkins in Crawling and Indexing
Matt Cutts has stirred up a little hornets’ nest with his “What should NOINDEX do?” post. Matt reckons the topic will be colossally boring to some people – but not to me. For some reason I find Robots standards fascinating. Yep, I know I’m weird.
The crux of Matt’s issue is …
The question is whether Google should completely drop a NOINDEX’ed page from our search results vs. show a reference to the page, or something in between?
The obvious response is to completely drop the NOINDEX’ed page. NOINDEX is made up of the two words NO and INDEX; so it means do not index, right?
Maybe not. It’s important to be precise here. What exactly does NOINDEX mean?
Often when talking about indexing issues, it’s useful to separate in your mind the indexing of a URL from the indexing of the content at that URL. This concept is particularly important in the contexts of URL canonicalization, duplicate content and … robots standards. I’ll restrict this discussion to the NOINDEX part of the robots standards, but an equally interesting discussion exists around robots.txt too.
Once we separate URL and content, the question “What exactly does NOINDEX mean?” can be answered in several ways:
1) Index the URL but not the content
2) Don’t index the URL or the content
3) (Somehow, not sure how!) index the content but not the URL
One thing is for sure … it does not mean index both the content and the URL.
In my opinion NOINDEX should definitely mean “Don’t index the content”. Definitely. No question.
The question of whether it should mean “Don’t index the URL” is an interesting one. There are arguments both ways. In my experience, however, there are many, many different examples of when it should mean “Don’t index the URL”. In these instances, if the URL was indexed, it would result in something bad happening either for searchers, or the site owner, or both. Therefore, generally, I think it should mean “Don’t index the URL”.
However, there is one specific case where I think it would be acceptable to index the URL, and which would give benefit to both searchers and site owners (very often). That specific case is when the URL is the home page of the site.
Taking the three “problem” URLs cited by Matt in his post:
If high-profile sites like
- http://www.police.go.kr/main/index.do (the National Police Agency of Korea)
- http://www.nmc.go.kr/ (the National Medical Center of Korea)
- http://www.yonsei.ac.kr/ (Yonsei University)aren’t showing up in Google because of the NOINDEX meta tag, that’s bad for users
These three URLs are all actually home pages. The second and third URLs are obviously so. The first URL is the result of a couple of 302 redirects:
- http://www.police.go.kr/ is a 302 to http://www.police.go.kr/index.jsp
- http://www.police.go.kr/index.jsp is a 302 to http://www.police.go.kr/main/index.do
This makes http://www.police.go.kr/main/index.do the home page of the site. The way Google works (correctly IMO) is that a redirect from “/” to a deeper page on a site would normally result in the content of that deeper URL being indexed under “/”.
So, I think a reasonable middle ground, that satisfies the best interests of searchers, site owners and search engine, would be the following:
- Do not index the content.
- Do not link to the URL in the search results, unless the URL is a “home page” (/, or redirected to by /).
- If it is a home page with a NOINDEX tag, it’s OK to link to it in the SERPs, but do not index the content; do not provide a snippet; and do not provide a cached copy. Treat it like a “partially indexed page”.
7
SilverDisc Gives Zach’s Helping Hand a Helping Hand
0 Comments | Posted by Alan Perkins in General
Charity Christmas cards are commendable, but still a large part of the cost of buying and posting charity cards does not end up with the charity itself.
So this year, SilverDisc has chosen to donate its entire Christmas card budget of £250 to local charity Zach’s Helping Hand, and instead send an electronic Christmas card.
Zach’s Helping Hand is used by families with children near to the end of life to receive palliative care within the love of their own homes. It is dedicated to the memory of Zach Sanders, who died of a brain tumour aged just two, and was set up by Zach’s parents Andy and Claire.
The photo shows me, Andy, Bella (Zach’s sister) and Lynda Litchfield of SilverDisc.

29
SilverDisc supports Wallace and Gromit’s Children’s Foundation
0 Comments | Posted by Alan Perkins in Austin A35 Van
I’m happy to report that SilverDisc won the bidding for Nick Park’s Austin A35 van on eBay on Saturday, with all proceeds being donated to the Wallace and Gromit’s Children’s Foundation. According to Nick Park, Oscar-winning creator of Wallace and Gromit:
I’ve always been a fan of Austins and this particular vehicle inspired me to come up with the Anti-Pesto van that was central to the plot and rehabilitation of the vegetable eating ‘pests’ in Curse of the Were-Rabbit. The van needed to be big enough to transport Wallace’s invention the Bunvac 2000 while at the same time slick enough to go on high speed chases after the formidable Were-Rabbit, and the Austin was a perfect match.
We’re very happy to be supporting the Wallace and Gromit’s Children’s Foundation with our purchase of this inspirational piece of movie memorabilia. We have some big plans for the van, which we’ll let you know about over the coming days and weeks on this blog.
19
Robogenic – the one word expression for “search engine friendly”
0 Comments | Posted by Alan Perkins in SEO
Poor old Ming Campbell. Literally, “old” Ming Campbell has resigned/been ousted from the leadership of the Liberal Democrats, the UK’s third political party, because he is too old at 66. Supposedly, in this News 24 society, you need to be young, dynamic and good looking in order to attract votes and one of his likely successors is said to be telegenic enough to fit the bill.
Telegenic. What a horrible word, a real cut and shut job (photogenic and television, but the result should mean “produced at a distance”, not “looks good on television”). But it got me thinking … the phrase “search engine friendly” has always seemed so clumsy. So what about “robogenic” as a one-word equivalent, meaning “search engine friendly”, or “looks good to a robot”.
- robogenic
- search engine friendly; looks good to a robot
I like it. Unfortunately, robogenic should literally mean “Produced by a robot”, in the same way as photogenic literally means “Produced by light”. Ah, so what? I still like it.
OK, sorry for the slightly misleading headline (although if you read on you’ll find it’s not that misleading). No apologies, though, for giving my opinion on what is now old news, which is that Google has dropped Best Practice Funding for agencies from 2009 onwards. Don’t ever expect this blog to be first with the news … there are others in the industry who are devoted to that. What you can expect here is considered, truthful opinion and, hopefully, an insight that you won’t find anywhere else.
There’s plenty of comment around about the fact that BPF was not a subsidy, was not a commission and was not, in fact, related to any individual advertiser but rather to the net billings of the whole agency. Personally, I think it’s great the playing field is levelled, but I’m still not looking forward to having to renegotiate rates with clients. Any agency that doesn’t have to renegotiate was either not receiving BPF or was charging too much in the first place, and SilverDisc does not fit either of those two categories, I’m happy to say.
What’s missing is comment on what BPF actually is, and what its withdrawal therefore signifies.
Probably the best document that describes what Best Practice Funding is, if you’re prepared to read between the lines, is the 2007 – Best Practice Funding Terms And Conditions. This lists several conditions that an agency must meet in order to fully qualify for BPF. Those conditions include:
- the fact that the agency, rather than the agency’s customer, must communicate with Google
- the fact that the agency is responsible for Google being paid its invoices on time
I can’t help feeling that Google is massively undervaluing the role of agencies in providing these services. Their support role, in both account management and invoicing, will grow enormously in 2009. I hope that Google uses the time between now and then to grow its infrastructure accordingly.
Another requirement on agencies to qualify for BPF is that they employ at least two GAP-qualified staff. This is where my slightly misleading headline actually has a ring of truth. The GAP exam has been the best tool for building and maintaining an understanding of Adwords. I’ve passed it myself and, before Christmas, I’m due to renew my qualification. All my PPC management staff and PPC programming staff (we write PPC API apps to manage our clients’ spends) have passed the GAP exam too and, again, are due to renew before Christmas.
I always thought that the Google’s encouragement of agency staff being GAP-qualified was of great benefit to Google, the agencies, and the industry as a whole. In dropping BPF, I think Google are sending a poor message – in, literally, stopping funding best practices, they are stopping supporting best practices.
5
Can’t Google Write Any Decent Analytics Documentation Themselves?
0 Comments | Posted by Alan Perkins in Analytics & Log Files, Google
Rarely does Google give such a public ringing endorsement for a third party as this, on the official Google Analytics blog:
Can I look forward to a link drop to SilverDisc here or here?
18
The robots.txt file and the robots meta tag
0 Comments | Posted by Alan Perkins in Crawling and Indexing, SEO, robots.txt
This article was first written and published by me in 2000/2001, but no longer exists on the Web. It’s still accurate – although search engines (notably Google) have taken steps to correct some of the problems described below, they can and do still arise.
There are two common protocols for the prevention of indexing of Web resources:
- The robots.txt protocol
- The robots meta tag protocol
This article describes:
- The theory and practice of these two protocols
- Anomalies and inadequacies in the protocols
The robots.txt protocol
A search engine spider is a Web robot and, as such, may choose to obey the robots.txt protocol. The robots.txt protocol was invented in 1994 and has remained as the de facto standard for controlling robots’ access to a Web site. Most search engines claim to support it, but no robot, including a search engine spider, has to support it.
The protocol is described in the document “A Standard for Robot Exclusion”. That is the page that most search engines that support the robots.txt protocol will refer you to if you require more details. However, if you read that page, you will see that it contains no reference to search engines at all. The introduction to the page says:
In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren’t welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren’t suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting).
These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed. This standard addresses this need with an operational solution.
So, the purpose of the robots.txt protocol is to provide a mechanism for WWW servers to indicate to robots which parts of their server should not be accessed, i.e. to prevent robots from reading parts of their server. How does this purpose relate to preventing a search engine from indexing a particular resource? Unfortunately, the general answer to this question is “It doesn’t”.
The Disallow line in a robots.txt file means “disallow reading”, but that does NOT mean “disallow indexing”. In other words a disallowed resource may be listed in a search engine’s index, even if the search engine obeys the protocol. The most obvious demonstration of this is Google. Google can add files to its index without reading them, merely by considering links to those files. In theory, Google can build an index of an entire Web site without ever visiting that site or ever retrieving its robots.txt file. In so doing it is not breaking the robots.txt protocol because it is not reading any disallowed resources, it is simply reading other web sites’ links to those resources.
The Disallow line in a robots.txt file means “Disallow reading”, it does not mean “Disallow indexing”. A resource does not necessarily need to be read in order to be indexed.
Let’s return to the question of how the robots.txt file can be used to prevent a search engine from listing a particular resource in its index. In practise most search engines have placed their own interpretation on the robots.txt file which allows it to be used to prevent them adding resources to their index, as follows. Most search engines interpret a resource being disallowed by the robots.txt file as meaning they should not add it to their index, and if it is already in their index (placed there by previous spidering activity) they remove it. This last point is important, and an example will illustrate the point.
A particular resource may have been published to a particular Web site on 1st January 2000. That resource may have been indexed by a search engine on 1st February 2000. On 1st March 2000, the site owner may have modified the site’s robots.txt file to disallow the resource from being read by the search engine spider. On 1st April 2000, the search engine spider may re-visit the Web site and note the new entry in the robots.txt file. The search engine spider may now simply choose not to read the resource but to leave the copy of the resource in its index unchanged, and this would not be breaking the robots.txt protocol. But most search engine spiders will both:
- not read the resource and
- remove the resource from their index.
In this example, note that throughout March the resource was in the search engine’s index even though it was disallowed by the robots.txt file.
In practice, most search engines interpret a Disallow line as meaning “Do not index this resource and, if you already have an index of this resource, remove it”. It may take some time from the point a resource is Disallowed to the point that resource is removed from a particular search engine’s index. If you want to ensure a particular resource is never indexed, ensure it is prevented from being indexed by a Disallow line in the robots.txt file before publishing the resource for the first time.
Now let’s consider how the robots.txt protocol can be used to prevent binary resources, such as images (e.g. GIF files), from being added to a search engine’s index. Let’s suppose a particular Web site put all its images in a directory called /images, and had the following robots.txt file:
User-agent: *
Disallow: /images/
You might think that this would prevent the site’s images being indexed by image search engines. But think again about what we have learned about the robots.txt file. It prevents Web robots, including search engine spiders, from reading a resource. But search engines do not need to read an image before adding it to their index. Many spiders just read the ALT text of the IMG tags that refer to the image, rather than reading the image itself. Since the spiders are not reading the image, they are not in breach of the robots.txt protocol if they index the image. This scenario is analogous to Google building an index of a resource without reading that resource: an image search engine can build an index of an image without reading an image.
Once again, in practise most image search engines interpret a Disallow line referring to an image as meaning “Do not index this image and, if you already have an index of this image, remove it”. It may take some time from the point an image is Disallowed to the point that image is removed from a particular image search engine’s index.
Finally, a question that exposes the worst flaw of the robots.txt protocol: a webmaster wishes to make all pages of a Web site, EXCEPT the home page (i.e. “/”), accessible to robots; how can she do this using the robots.txt protocol? The answer – “She can’t”.
The robots meta tag protocol
The robots meta tag protocol was invented after the robots.txt protocol. It was originally designed to allow HTML developers that did not have permission to write the robots.txt file to the root of a server to have control over the indexing of Web pages. Unlike the robots.txt protocol, the robots meta tag protocol:
- specifically states whether a resource may or may not be indexed
- can help, but cannot prevent, a particular resource from being read
- does not allow large-scale (wildcard) prevention of indexing
- cannot be used to prevent anything except HTML files from being indexed, since the meta tag can only be placed in HTML files (if following the strict definition of the protocol)
Note in particular point 2: the robots meta tag protocol cannot prevent a particular resource from being read because a resource must be read in order to obtain the tag it contains. You may think that if every document that linked to a particular resource contained a robots meta tag NOFOLLOW attribute, that resource could never be read – but what if a new document is added to anywhere on the Web, and that document links to the resource? Or what if somebody submits the resource directly to the Add URL page of a search engine? In both these cases, a search engine will read the resource before discovering the robots meta tag. So the problems the robots.txt protocol was designed to fix – e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting) – are not addressed by the robots meta tag protocol. In other words, there is no “NOREAD” attribute!
So, we’ve said what the robots meta tag is not, but what is it? The robots meta tag is included in a HTML file and defines separately whether the file may be indexed (using the INDEX attribute) or spidered (using the FOLLOW attribute). However, the robots meta tag enjoys less support than the robots.txt file. It is unclear how much of the standard search engines support. Would every search engine, for example, correctly interpret a “noindex, follow” set of attributes?
Since the robots meta tag can only be used within a HTML file, and the NOINDEX attribute only refers to the file that contains it, it cannot be used to prevent binary resources (such as images) from being indexed. Some search engines have invented extensions to the protocol to overcome this problem, but the extensions are not part of the protocol. For example, AltaVista has invented its own robots meta tag attribute (NOIMAGEINDEX) to prevent images from being indexed.
The behaviour of these extension tags is not well defined. An example will illustrate the main problem:
- a particular Web site, let’s call it www.example-one.com, consists of 10 pages
- each of the 10 pages includes an image at www.example-one.com/images/example.gif
- nine of the ten pages contain a robots meta tag like this: <META NAME=”robots” CONTENT=”index,follow,noimageindex”>
- however, www.example-one.com’s home page contains the following robots meta tag: <META NAME=”robots” CONTENT=”index,follow”>
The “noimageindex” attribute is only understood by AltaVista’s image spider. So, when AltaVista’s image spider reads the site, will it add example.gif to AltaVista’s image index? The answer to this is question is undefined – nine out of ten pages say it’s not OK to index the image, but one out of ten pages says (implicitly) that it is OK. So the image spider might, or might not, index the image. It all depends on the order the spider reads the pages, the number of pages read by the spider (it might only read the home page), and a multitude of other factors.
To make matters worse, now suppose that there is another Web site called www.example-two.com, every page of which also includes www.example-one.com/images/example.gif. None of the pages on www.example-two.com include a robots meta tag. Would an image spider add example.gif to its index now? Again, the answer to this question is undefined.
Now a question to test the theory so far … A site owner attempts to exclude a page from being indexed by search engines by both adding a Disallow line in the site robots.txt file and a meta robots tag with noindex attribute into the page itself, before publishing the resource for the first time. Is there any way that a search engine that obeys the robots.txt protocol and the robots meta tag meticulously can have a reference to the resource in its index?
Let’s work this through.
- Suppose the resource is called noindex.htm and it contains the following robots meta tag: <META NAME=”robots” CONTENT=”noindex,nofollow”>
- The URL http://www.example-three.com/robots.txt is then created as follows:
User-agent: *
Disallow: /noindex.htm - noindex.htm is then published to www.example-three.com/noindex.htm for the first time.
Surely noindex.htm can’t possibly be indexed by a search engine that obeys the robots.txt protocol and the robots meta tag protocol? Can it? It can. In fact, only a search engine that completely obeys both standards can index it. Here’s how.
Our very obedient search engine works a little like Google. So, while its spider is spidering the Web, it finds references to noindex.htm. Each time it finds a reference, the spider creates a better picture of noindex.htm in its index, without ever reading noindex.htm. Sooner or later, the spider visits www.example-three.com. The first thing it does is read robots.txt to find pages it is not allowed to read. The only page it is not allowed to read is noindex.htm, so it doesn’t read that page. It doesn’t remove the page from its index, because, strictly speaking, that is not what the robots.txt protocol means. Because the spider cannot read noindex.htm, it cannot find the robots tag on that page preventing it from indexing that page. Therefore, the page remains in the search engine’s index.
Future posts will address the new features in robots.txt, the robots meta tag and Webmaster tools, that address some of the above problems.
14
How much PageRank does a page that is not in the index have?
0 Comments | Posted by Alan Perkins in Crawling and Indexing, Google, Links, Search Engines
If a page has a NOINDEX tag on it, how much PageRank does it have? The intuitive answer would be “None”. How can a page that is not indexed have PageRank? Wouldn’t it be treated like a dangling link and disregarded during PageRank calculations?
Apparently not. Matt Cutts states on seomoz:
Does a link from a page with meta robots=”noindex, follow” carry less weight? no weight?
For Google, I believe such links would carry the same weight as normal links on regular pages.
Hmmm. Does he mean that the unindexed page actually has a PageRank? Or does he mean that the zero Pagerank that the unindexed page has would be divided out among the links on the page, giving nothing to each? I wonder …
One thing’s for sure … if “NOINDEX, FOLLOW” works as implied, it’s a great way to inject spammy content and links.
