Revision as of 04:26, 24 June 2007

Robot Exclusion Profile

Draft Specification 2005-06-18

Authors

Peter Janes

Copyright

This specification is © 2004-2005 by the author. However, the author intends to submit this specification to a standards body with a liberal copyright/licensing policy such as the GMPG. See the GMPG Principles for more details. Anyone wishing to contribute to this effort MUST read those principles, especially those regarding copyright and licensing, and agree to them before contributing.

Patents

The author neither holds nor intends to apply for any patents on anything required to implement this specification.

Abstract

The Robot Exclusion Profile is a reworking of the Robots META tag (and less-standard extensions) as a microformat.

Introduction

The Robots META tag is used to provide page-specific direction for web crawlers. While being useful in many cases, its page-specific nature means it cannot be used to restrict crawlers from indexing only certain sections of a document. Several attempts have been made to create more granular solutions through various methods but have perceived shortcomings that limit their use; the Robot Exclusion Profile defines a microformat that can be applied to any element or set of elements in a page.

Like other microformats such as hCalendar, the Robot Exclusion Profile defines a set of class names that may be applied to (X)HTML elements. class can be applied to almost every (X)HTML element, which means that authors may be as specific or general as they wish in their application. This differs from the similarly-purposed rel="nofollow" attribute, which may only be applied to (and does not refer to the content of) a specific inline link. (It is interesting to note that this behaviour is entirely encompassed by the use of class="robots-nofollow" on the same element.) Classes are also additive, so multiple values can be specified at once, e.g. class="robots-nofollow robots-noindex". For robot exclusion in particular, this allows authors to specify multiple rules for an element without adding unnecessary extra markup.

Format

Profile URI

http://example.org/xmdp/robots-profile# (obviously preliminary)

The classes defined by the Robot Exclusion Profile should be considered meaningless when the profile URI is not present in the document

@@ Line 24: / Line 24: @@
 <code><nowiki>http://example.org/xmdp/robots-profile#</nowiki></code> (obviously preliminary)
-The classes defined by the Robot Exclusion Profile should be considered meaningless when the profile URI is not present in the document <code>&lt;head&gt;</code>'s <code>profile</code> attribute.
+The classes defined by the Robot Exclusion Profile should be considered meaningless when the profile URI is not present in the document <code>
-=== XMDP Profile ===
-<pre><nowiki><dl class="profile">
- <dt id="robots-nofollow">robots-nofollow</dt>
- <dd>
-  Informs robots that links contained by the element are not to be followed.
- </dd>
- <dt id="robots-follow">robots-follow</dt>
- <dd>
-  Informs robots that links contained by the element are to be followed.
- </dd>
- <dt id="robots-noindex">robots-noindex</dt>
- <dd>
-  Informs robots that the content of the element is not to be included as part of the page.
- </dd>
- <dt id="robots-index">robots-index</dt>
- <dd>
-  Informs robots that the content of the element is to be included as part of the page.
- </dd>
- <dt id="robots-noanchortext">robots-noanchortext</dt>
- <dd>
-  Informs robots that the link target document is not to be indexed under the anchor text.
- </dd>
- <dt id="robots-anchortext">robots-anchortext</dt>
- <dd>
-  Informs robots that the link target document is to be indexed under the anchor text.
- </dd>
- <dt id="robots-noarchive">robots-noarchive</dt>
- <dd>
-  Informs caching robots that the content of the element is not to be included in their cached copy.
- </dd>
- <dt id="robots-archive">robots-archive</dt>
- <dd>
-  Informs caching robots that the content of the element is to be included in their cached copy.
- </dd>
-</dl></nowiki></pre>
-== Examples ==
-Removing page content:
-<pre><nowiki>
-<head profile=”http://example.org/xmdp/robots-profile#”>
-...
-<div class=”robots-noindex”>There once was a man from Nantucket…</div>
-<p>This page is not about <span class=”robots-noindex”>pornography</span>.</p>
-</nowiki></pre>
-Showing <code>nofollow</code> in conjunction with [[votelinks]], and applying it in parallel with [[relnofollow]]:
-<pre><nowiki>
-<head profile=”http://example.org/xmdp/robots-profile#”>
-...
-<p class=”robots-nofollow”>This is <a href=”http://example.com/bogus”>a bogus link</a>
-and so is <a href=”http://example.net/bogus”>this</a>.</p>
-<p>I don't like <a rel="nofollow" rev="vote-against" class="robots-nofollow"
-                   href="http://example.com/disagree">this page</a>
-but I do like <a rev="vote-for" href="http://example.com/agree">this one</a>.</p>
-</nowiki></pre>
-Preventing images from being stored by search engines, forcing them to be retrieved from the originating website:
-<pre><nowiki>
-<head profile="http://example.org/xmdp/robots-profile#">
-...
-<p><img src="example.png" class="robots-noarchive" alt="Private image" /></p>
-</nowiki></pre>
-A consequence of this is that the small summaries that modern search engines display with the result links also exclude the <code>robots-noarchive</code>.  We suggest replacing small excluded segments with an ellipsis [<code>...</code>].  Unarchived segments of a size comparable to the segments the search engine normally uses for summaries can just be omitted.  Probably a display of an entire cached document which has unarchived segments should also include some locution to show the places where text has been elided, no matter what the size.
-A [http://peterjanes.ca/2005/robots/example more complex example] is available which also shows how the robots metadata may be [http://tantek.com/log/2005/06.html#d03t2359 visualized].
-== References ==
-=== Normative ===
-* [http://gmpg.org/xmdp/ XMDP]
-* [http://www.robotstxt.org/wc/meta-user.html The Robots META Tag]
-=== Informative ===
-* [http://www.robotstxt.org/wc/norobots.html A Standard for Robot Exclusion]
-* [http://www.google.com/bot.html#noindextags Googlebot Frequently Asked Questions]
-* [http://www.bauser.com/websnob/meta/robots.html The ROBOTS META Tag]
-* [[relnofollow|RelNoFollow Draft Specification]]
-* This page was contributed from the [http://developers.technorati.com/wiki/RobotsExclusion technorati developers' wiki].
-=== Thanks ===
-* [http://tantek.com/log/ Tantek Çelik]
-* [http://www.lachy.id.au/ Lachlan Hunt]
-* [http://www.joesapt.net/ Joe D'Andrea]
-== Issues ==
-These are open issues that have been raised in various forums.  The "efficacy" and "collateral damage" issues from [[relnofollow#open_issues|rel="nofollow"]] also apply.
-=== Precedence ===
-* Should earlier values take precedence or later?  Does <code>class="robots-nofollow robots-follow"</code> means the same as <code>class="robots-nofollow"</code> or <code>class="robots-follow"</code>?
-* <code>meta</code> tag suggests not using conflicting or repeating directives and so does not specify precedence.  <code>&lt;p class="robots-noindex robot1-index"&gt;</code> is an apparent conflict but in this case the more specific should obviously override the general at its point of applicability, no matter what order the directives appear in.
-* Interaction with [[relnofollow]]: what does <code>class="robots-follow" rel="nofollow"</code> mean?  Currently [[relnofollow]] has no profile URI defined, so the Robot Exclusion Profile takes precedence.  In the future, per XMDP's [http://gmpg.org/xmdp/description#multiple Using Multiple Profiles], <q>the URIs in the 'profile' attribute are to be treated most significant (first) to least significant (last).</q>
-=== Phrases ===
-Modern search engines normally support <i>phrase</i> queries.  A phrase query only maches documents that contain the words of the query, consecutively and in the same order.  That does beg the question of whether a matched phrase should be allowed to straddle a <code>class="robots-noindex"</code> region.
-Intuitively this should not be allowed.  The phrase query <code>"word1 word2"</code> should not match a document that contains <code>word1 &lt;b class="robots-noindex&gt;ignore&lt;/b&gt; word2</code>.  This does allow for an interesting tool for webmasters can specify that juxtaposed words not be considered to be phrases -- just specify an empty unindexed region as in <code>word1 &lt;b class="robots-noindex&gt;&lt;/b&gt; word2</code>.
-=== Specificity ===
-* Does not allow control of specific UAs à la [http://www.robotstxt.org/wc/norobots.html A Standard for Robot Exclusion]
-If it is actually necessary to control specific UAs here is an possible soluiton.
-Example:
-<pre><nowiki>
-<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
-<html>
-<head>
-<link rel="schema.RobotExclusion" href="http://example.org/.../" />
-<meta name="RobotExclusion.RobotName1" content="Foo Bot" />
-<meta name="RobotExclusion.RobotName2" content="Bar Bot" />
-<meta name="RobotExclusion.RobotName3" content="Evil Bot" />
-</head>
-<body>
-<h1>Page</h1>
-<p class="robots-noindex">This paragraph shouldn't be indexed by any bot.</p>
-<p class="robot3-noindex">This paragraph should be indexed by every bot except "Evil Bot".</p>
-<p class="robots-noindex robot1-index">This paragraph should only be indexed by "Foo Bot".</p>
-</div>
-</body>
-</html>
-</nowiki></pre>
-Of course it is a waste of bandwith if there are "RobotExclusion.RobotName" meta tags
-on every page of a website. Thus this metatags should be stored on one page - perhaps the
-main page - so they can be maintained easily.
-<pre><nowiki>
-<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
-<html>
-<head>
-<link rel="schema.RobotExclusion" href="http://example.org/.../" />
-<link rel="RobotExclusion.Names" href="http://mypage.com/" />
-</head>
-<body>
-<h1>Page</h1>
-<p class="robots-noindex">This paragraph shouldn't be indexed by any bot.</p>
-<p class="robot3-noindex">This paragraph should be indexed by every bot except "Evil Bot".</p>
-<p class="robots-noindex robot1-index">This paragraph should only be indexed by "Foo Bot".</p>
-</div>
-</body>
-</html>
-</nowiki></pre>
-=== Keywords ===
-* The keywords <code>all</code> and <code>none</code> are defined by the Robots META Tag as convenience shortcuts to enable or disable the combination of <code>nofollow</code> and <code>noindex</code>, but predate Google's <code>noarchive</code> and should not be considered to include it.  As a result, for purposes of clarity and simplicity (the [http://gmpg.org/xmdp/description#principles XMDP Minimalism principle]), they are not included in this version of the Robot Exclusion Profile.
-=== Suitability as a microformat ===
-* Isn't the Robot Exclusion Profile designed for machines first and humans second instead of vice versa?  Yes, just as much as [[relnofollow]], the deployed microformat that it's designed to replace.
-* I'd like to echo this concern. We need to discuss whether or not this is a suitable microformat. --[[User:RyanKing|RyanKing]] 13:34, 17 Jan 2006 (PST)
-=== Extension ===
-* As I read this, I had the idea to use this microformat to differentiate the real content of a webpage from the rest (navigation, header, footer, ...) - you could do this by marking the "real content" with the tag "index", but thats not really clear. Maybe you could create a new tag to mark the really important things on the page (the "real content") from the rest. --[[User:Habakuk|Habakuk]] 03:42, 14 Jan 2007 (PST)
-* And another idea is to mark an area of a page as independent from the rest (p.e. for listings of softwaretools - if i search for an software that can do ''a'' and ''b'' i don't want to get a result that offers me a software that can do ''a'' and another that can do ''b''). --[[User:Habakuk|Habakuk]] 03:42, 14 Jan 2007 (PST)

robots-exclusion: Difference between revisions

Revision as of 04:26, 24 June 2007

Robot Exclusion Profile

Contents

Draft Specification 2005-06-18

Authors

Copyright

Patents

Abstract

Introduction

Format

Profile URI

Navigation menu

robots-exclusion: Difference between revisions

Revision as of 04:26, 24 June 2007

Robot Exclusion Profile

Draft Specification 2005-06-18

Authors

Copyright

Patents

Abstract

Introduction

Format

Profile URI

Navigation menu

Search