robots-exclusion: Difference between revisions

From Microformats Wiki
Jump to navigation Jump to search
No edit summary
m (Replace <entry-title> with {{DISPLAYTITLE:}})
 
(15 intermediate revisions by 9 users not shown)
Line 1: Line 1:
= Robot Exclusion Profile =
{{DISPLAYTITLE:Robot Exclusion Profile}}
{{DraftSpecification}}
__TOC__
__TOC__
== Draft Specification 2005-06-18 ==
== Draft Specification 2005-06-18 ==
Line 7: Line 8:


=== Copyright ===
=== Copyright ===
This specification is © 2004-2005 by the author. However, the author intends to submit this specification to a standards body with a liberal copyright/licensing policy such as the [http://gmpg.org/ GMPG]. See the [http://gmpg.org/principles GMPG Principles] for more details. Anyone wishing to contribute to this effort MUST read those principles, especially those regarding copyright and licensing, and agree to them before contributing.
Per the public domain release on the author's and contributors' user pages ([[User:PeterJ|Peter Janes]], [[User:RyanKing|Ryan King]], [[User:Tantek|Tantek Çelik]]) this specification is released into the public domain.
 
Public Domain Contribution Requirement. Since the author(s) released this work into the public domain, in order to maintain this work's public domain status, all contributors to this page agree to release their contributions to this page to the public domain as well. Contributors may indicate their agreement by adding the public domain release template (http://microformats.org/wiki/Template:public-domain-release) to their user page per the Voluntary Public Domain Declarations instructions (http://microformats.org/wiki/Category:public_domain_license). Unreleased contributions may be reverted/removed.


=== Patents ===
=== Patents ===
Line 13: Line 16:


== Abstract ==
== Abstract ==
The Robot Exclusion Profile is a reworking of the Robots META tag (and less-standard extensions) as a [[microformat]].
The Robot Exclusion Profile is a reworking of the [[Robots META]] tag (and less-standard extensions) as a [[microformat]].


== Introduction ==
== Introduction ==
The [http://www.robotstxt.org/wc/meta-user.html Robots META tag] is used to provide page-specific direction for web crawlers.  While being useful in many cases, its page-specific nature means it cannot be used to restrict crawlers from indexing only certain sections of a document.  Several attempts have been made to create more granular solutions through various methods but have perceived shortcomings that limit their use; the Robot Exclusion Profile defines a microformat that can be applied to any element or set of elements in a page.
The [[Robots META]] tag is used to provide page-specific direction for web crawlers.  While being useful in many cases, its page-specific nature means it cannot be used to restrict crawlers from indexing only certain sections of a document.  Several attempts have been made to create more granular solutions through various methods but have perceived shortcomings that limit their use; the Robot Exclusion Profile defines a microformat that can be applied to any element or set of elements in a page.


Like other microformats such as [[hcalendar|hCalendar]], the Robot Exclusion Profile defines a set of class names that may be applied to (X)HTML elements.  <code>class</code> can be applied to almost every (X)HTML element, which means that authors may be as specific or general as they wish in their application.  This differs from the similarly-purposed <code>rel="nofollow"</code> attribute, which may only be applied to (and does not refer to the content of) a specific inline link.  (It is interesting to note that this behaviour is entirely encompassed by the use of <code>class="robots-nofollow"</code> on the same element.)  Classes are also additive, so multiple values can be specified at once, e.g. <code>class="robots-nofollow robots-noindex"</code>.  For robot exclusion in particular, this allows authors to specify multiple rules for an element without adding unnecessary extra markup.
Like other microformats such as [[hcalendar|hCalendar]], the Robot Exclusion Profile defines a set of class names that may be applied to (X)HTML elements.  <code>class</code> can be applied to almost every (X)HTML element, which means that authors may be as specific or general as they wish in their application.  This differs from the similarly-purposed <code>rel="nofollow"</code> attribute, which may only be applied to (and does not refer to the content of) a specific inline link.  (It is interesting to note that this behavior is entirely encompassed by the use of <code>class="robots-nofollow"</code> on the same element.)  Classes are also additive, so multiple values can be specified at once, e.g. <code>class="robots-nofollow robots-noindex"</code>.  For robot exclusion in particular, this allows authors to specify multiple rules for an element without adding unnecessary extra markup.


== Format ==
== Format ==
=== Profile URI ===
=== Profile URI ===
<code><nowiki>http://example.org/xmdp/robots-profile#</nowiki></code> (obviously preliminary)
<code><nowiki>http://example.org/xmdp/robots-profile#</nowiki></code> (obviously placeholder)
 
The classes defined by the Robot Exclusion Profile should be considered meaningless when the profile URI is not present in the document <code>&lt;head&gt;</code>'s <code>profile</code> attribute.
 
=== XMDP Profile ===
<pre><nowiki><dl class="profile">
<dt id="robots-nofollow">robots-nofollow</dt>
<dd>
  Informs robots that links contained by the element are not to be followed.
</dd>
<dt id="robots-follow">robots-follow</dt>
<dd>
  Informs robots that links contained by the element are to be followed.
</dd>
<dt id="robots-noindex">robots-noindex</dt>
<dd>
  Informs robots that the content of the element is not to be included as part of the page.
</dd>
<dt id="robots-index">robots-index</dt>
<dd>
  Informs robots that the content of the element is to be included as part of the page.
</dd>
<dt id="robots-noanchortext">robots-noanchortext</dt>
<dd>
  Informs robots that the link target document is not to be indexed under the anchor text.
</dd>
<dt id="robots-anchortext">robots-anchortext</dt>
<dd>
  Informs robots that the link target document is to be indexed under the anchor text.
</dd>
<dt id="robots-noarchive">robots-noarchive</dt>
<dd>
  Informs caching robots that the content of the element is not to be included in their cached copy.
</dd>
<dt id="robots-archive">robots-archive</dt>
<dd>
  Informs caching robots that the content of the element is to be included in their cached copy.
</dd>
</dl></nowiki></pre>
 
== Examples ==
Removing page content:
<pre><nowiki>
<head profile=”http://example.org/xmdp/robots-profile#”>
...
<div class=”robots-noindex”>There once was a man from Nantucket…</div>
<p>This page is not about <span class=”robots-noindex”>pornography</span>.</p>
</nowiki></pre>
 
Showing <code>nofollow</code> in conjunction with [[votelinks]], and applying it in parallel with [[rel-nofollow]]:
 
<pre><nowiki>
<head profile=”http://example.org/xmdp/robots-profile#”>
...
<p class=”robots-nofollow”>This is <a href=”http://example.com/bogus”>a bogus link</a>
and so is <a href=”http://example.net/bogus”>this</a>.</p>
 
<p>I don't like <a rel="nofollow" rev="vote-against" class="robots-nofollow"
                  href="http://example.com/disagree">this page</a>
but I do like <a rev="vote-for" href="http://example.com/agree">this one</a>.</p>
</nowiki></pre>
 
Preventing images from being stored by search engines, forcing them to be retrieved from the originating website:
 
<pre><nowiki>
<head profile="http://example.org/xmdp/robots-profile#">
...
<p><img src="example.png" class="robots-noarchive" alt="Private image" /></p>
</nowiki></pre>
 
A consequence of this is that the small summaries that modern search engines display with the result links also exclude the <code>robots-noarchive</code>.  We suggest replacing small excluded segments with an ellipsis [<code>...</code>].  Unarchived segments of a size comparable to the segments the search engine normally uses for summaries can just be omitted.  Probably a display of an entire cached document which has unarchived segments should also include some locution to show the places where text has been elided, no matter what the size.
 
A [http://peterjanes.ca/2005/robots/example more complex example] is available which also shows how the robots metadata may be [http://tantek.com/log/2005/06.html#d03t2359 visualized].
 
== References ==
=== Normative ===
* [http://gmpg.org/xmdp/ XMDP]
* [http://www.robotstxt.org/wc/meta-user.html The Robots META Tag]
 
=== Informative ===
* [http://www.robotstxt.org/wc/norobots.html A Standard for Robot Exclusion]
* [http://www.google.com/bot.html#noindextags Googlebot Frequently Asked Questions]
* [http://www.bauser.com/websnob/meta/robots.html The ROBOTS META Tag]
* [[relnofollow|RelNoFollow Draft Specification]]
* This page was contributed from the [http://developers.technorati.com/wiki/RobotsExclusion technorati developers' wiki].
 
=== Thanks ===
* [http://tantek.com/log/ Tantek Çelik]
* [http://www.lachy.id.au/ Lachlan Hunt]
* [http://www.joesapt.net/ Joe D'Andrea]
 
== related pages ==
* <span id="Issues"> [[robots-exclusion-issues]]</span>
* [[robots-exclusion-brainstorming]]


The classes defined by the Robot Exclusion Profile should be considered meaningless when the profile URI is not present in the document <code>
[[Category:Draft Specifications]]
[[Category:robots-exclusion]]

Latest revision as of 16:32, 18 July 2020

This document represents a draft microformat specification. Although drafts are somewhat mature in the development process, the stability of this document cannot be guaranteed, and implementers should be prepared to keep abreast of future developments and changes. Watch this wiki page, or follow discussions on the #microformats IRC channel to stay up-to-date.

Draft Specification 2005-06-18

Authors

Copyright

Per the public domain release on the author's and contributors' user pages (Peter Janes, Ryan King, Tantek Çelik) this specification is released into the public domain.

Public Domain Contribution Requirement. Since the author(s) released this work into the public domain, in order to maintain this work's public domain status, all contributors to this page agree to release their contributions to this page to the public domain as well. Contributors may indicate their agreement by adding the public domain release template (http://microformats.org/wiki/Template:public-domain-release) to their user page per the Voluntary Public Domain Declarations instructions (http://microformats.org/wiki/Category:public_domain_license). Unreleased contributions may be reverted/removed.

Patents

The author neither holds nor intends to apply for any patents on anything required to implement this specification.

Abstract

The Robot Exclusion Profile is a reworking of the Robots META tag (and less-standard extensions) as a microformat.

Introduction

The Robots META tag is used to provide page-specific direction for web crawlers. While being useful in many cases, its page-specific nature means it cannot be used to restrict crawlers from indexing only certain sections of a document. Several attempts have been made to create more granular solutions through various methods but have perceived shortcomings that limit their use; the Robot Exclusion Profile defines a microformat that can be applied to any element or set of elements in a page.

Like other microformats such as hCalendar, the Robot Exclusion Profile defines a set of class names that may be applied to (X)HTML elements. class can be applied to almost every (X)HTML element, which means that authors may be as specific or general as they wish in their application. This differs from the similarly-purposed rel="nofollow" attribute, which may only be applied to (and does not refer to the content of) a specific inline link. (It is interesting to note that this behavior is entirely encompassed by the use of class="robots-nofollow" on the same element.) Classes are also additive, so multiple values can be specified at once, e.g. class="robots-nofollow robots-noindex". For robot exclusion in particular, this allows authors to specify multiple rules for an element without adding unnecessary extra markup.

Format

Profile URI

http://example.org/xmdp/robots-profile# (obviously placeholder)

The classes defined by the Robot Exclusion Profile should be considered meaningless when the profile URI is not present in the document <head>'s profile attribute.

XMDP Profile

<dl class="profile">
 <dt id="robots-nofollow">robots-nofollow</dt>
 <dd>
  Informs robots that links contained by the element are not to be followed.
 </dd>
 <dt id="robots-follow">robots-follow</dt>
 <dd>
  Informs robots that links contained by the element are to be followed.
 </dd>
 <dt id="robots-noindex">robots-noindex</dt>
 <dd>
  Informs robots that the content of the element is not to be included as part of the page.
 </dd>
 <dt id="robots-index">robots-index</dt>
 <dd>
  Informs robots that the content of the element is to be included as part of the page.
 </dd>
 <dt id="robots-noanchortext">robots-noanchortext</dt>
 <dd>
  Informs robots that the link target document is not to be indexed under the anchor text.
 </dd>
 <dt id="robots-anchortext">robots-anchortext</dt>
 <dd>
  Informs robots that the link target document is to be indexed under the anchor text.
 </dd>
 <dt id="robots-noarchive">robots-noarchive</dt>
 <dd>
  Informs caching robots that the content of the element is not to be included in their cached copy.
 </dd>
 <dt id="robots-archive">robots-archive</dt>
 <dd>
  Informs caching robots that the content of the element is to be included in their cached copy.
 </dd>
</dl>

Examples

Removing page content:

<head profile=”http://example.org/xmdp/robots-profile#”>
...
<div class=”robots-noindex”>There once was a man from Nantucket…</div>
<p>This page is not about <span class=”robots-noindex”>pornography</span>.</p>

Showing nofollow in conjunction with votelinks, and applying it in parallel with rel-nofollow:

<head profile=”http://example.org/xmdp/robots-profile#”>
...
<p class=”robots-nofollow”>This is <a href=”http://example.com/bogus”>a bogus link</a>
and so is <a href=”http://example.net/bogus”>this</a>.</p>

<p>I don't like <a rel="nofollow" rev="vote-against" class="robots-nofollow"
                   href="http://example.com/disagree">this page</a>
but I do like <a rev="vote-for" href="http://example.com/agree">this one</a>.</p>

Preventing images from being stored by search engines, forcing them to be retrieved from the originating website:

<head profile="http://example.org/xmdp/robots-profile#">
...
<p><img src="example.png" class="robots-noarchive" alt="Private image" /></p>

A consequence of this is that the small summaries that modern search engines display with the result links also exclude the robots-noarchive. We suggest replacing small excluded segments with an ellipsis [...]. Unarchived segments of a size comparable to the segments the search engine normally uses for summaries can just be omitted. Probably a display of an entire cached document which has unarchived segments should also include some locution to show the places where text has been elided, no matter what the size.

A more complex example is available which also shows how the robots metadata may be visualized.

References

Normative

Informative

Thanks

related pages