|
|
Line 24: |
Line 24: |
| <code><nowiki>http://example.org/xmdp/robots-profile#</nowiki></code> (obviously preliminary) | | <code><nowiki>http://example.org/xmdp/robots-profile#</nowiki></code> (obviously preliminary) |
|
| |
|
| The classes defined by the Robot Exclusion Profile should be considered meaningless when the profile URI is not present in the document <code><head></code>'s <code>profile</code> attribute. | | The classes defined by the Robot Exclusion Profile should be considered meaningless when the profile URI is not present in the document <code> |
| | |
| === XMDP Profile ===
| |
| <pre><nowiki><dl class="profile">
| |
| <dt id="robots-nofollow">robots-nofollow</dt>
| |
| <dd>
| |
| Informs robots that links contained by the element are not to be followed.
| |
| </dd>
| |
| <dt id="robots-follow">robots-follow</dt>
| |
| <dd>
| |
| Informs robots that links contained by the element are to be followed.
| |
| </dd>
| |
| <dt id="robots-noindex">robots-noindex</dt>
| |
| <dd>
| |
| Informs robots that the content of the element is not to be included as part of the page.
| |
| </dd>
| |
| <dt id="robots-index">robots-index</dt>
| |
| <dd>
| |
| Informs robots that the content of the element is to be included as part of the page.
| |
| </dd>
| |
| <dt id="robots-noanchortext">robots-noanchortext</dt>
| |
| <dd>
| |
| Informs robots that the link target document is not to be indexed under the anchor text.
| |
| </dd>
| |
| <dt id="robots-anchortext">robots-anchortext</dt>
| |
| <dd>
| |
| Informs robots that the link target document is to be indexed under the anchor text.
| |
| </dd>
| |
| <dt id="robots-noarchive">robots-noarchive</dt>
| |
| <dd>
| |
| Informs caching robots that the content of the element is not to be included in their cached copy.
| |
| </dd>
| |
| <dt id="robots-archive">robots-archive</dt>
| |
| <dd>
| |
| Informs caching robots that the content of the element is to be included in their cached copy.
| |
| </dd>
| |
| </dl></nowiki></pre>
| |
| | |
| == Examples ==
| |
| Removing page content:
| |
| <pre><nowiki>
| |
| <head profile=”http://example.org/xmdp/robots-profile#”>
| |
| ...
| |
| <div class=”robots-noindex”>There once was a man from Nantucket…</div>
| |
| <p>This page is not about <span class=”robots-noindex”>pornography</span>.</p>
| |
| </nowiki></pre>
| |
| | |
| Showing <code>nofollow</code> in conjunction with [[votelinks]], and applying it in parallel with [[relnofollow]]:
| |
| | |
| <pre><nowiki>
| |
| <head profile=”http://example.org/xmdp/robots-profile#”>
| |
| ...
| |
| <p class=”robots-nofollow”>This is <a href=”http://example.com/bogus”>a bogus link</a>
| |
| and so is <a href=”http://example.net/bogus”>this</a>.</p>
| |
| | |
| <p>I don't like <a rel="nofollow" rev="vote-against" class="robots-nofollow"
| |
| href="http://example.com/disagree">this page</a>
| |
| but I do like <a rev="vote-for" href="http://example.com/agree">this one</a>.</p>
| |
| </nowiki></pre>
| |
| | |
| Preventing images from being stored by search engines, forcing them to be retrieved from the originating website:
| |
| | |
| <pre><nowiki>
| |
| <head profile="http://example.org/xmdp/robots-profile#">
| |
| ...
| |
| <p><img src="example.png" class="robots-noarchive" alt="Private image" /></p>
| |
| </nowiki></pre>
| |
| | |
| A consequence of this is that the small summaries that modern search engines display with the result links also exclude the <code>robots-noarchive</code>. We suggest replacing small excluded segments with an ellipsis [<code>...</code>]. Unarchived segments of a size comparable to the segments the search engine normally uses for summaries can just be omitted. Probably a display of an entire cached document which has unarchived segments should also include some locution to show the places where text has been elided, no matter what the size.
| |
| | |
| A [http://peterjanes.ca/2005/robots/example more complex example] is available which also shows how the robots metadata may be [http://tantek.com/log/2005/06.html#d03t2359 visualized].
| |
| | |
| == References ==
| |
| === Normative ===
| |
| * [http://gmpg.org/xmdp/ XMDP]
| |
| * [http://www.robotstxt.org/wc/meta-user.html The Robots META Tag]
| |
| | |
| === Informative ===
| |
| * [http://www.robotstxt.org/wc/norobots.html A Standard for Robot Exclusion]
| |
| * [http://www.google.com/bot.html#noindextags Googlebot Frequently Asked Questions]
| |
| * [http://www.bauser.com/websnob/meta/robots.html The ROBOTS META Tag]
| |
| * [[relnofollow|RelNoFollow Draft Specification]]
| |
| * This page was contributed from the [http://developers.technorati.com/wiki/RobotsExclusion technorati developers' wiki].
| |
| | |
| === Thanks ===
| |
| * [http://tantek.com/log/ Tantek Çelik]
| |
| * [http://www.lachy.id.au/ Lachlan Hunt]
| |
| * [http://www.joesapt.net/ Joe D'Andrea]
| |
| | |
| == Issues ==
| |
| These are open issues that have been raised in various forums. The "efficacy" and "collateral damage" issues from [[relnofollow#open_issues|rel="nofollow"]] also apply.
| |
| | |
| === Precedence ===
| |
| * Should earlier values take precedence or later? Does <code>class="robots-nofollow robots-follow"</code> means the same as <code>class="robots-nofollow"</code> or <code>class="robots-follow"</code>?
| |
| * <code>meta</code> tag suggests not using conflicting or repeating directives and so does not specify precedence. <code><p class="robots-noindex robot1-index"></code> is an apparent conflict but in this case the more specific should obviously override the general at its point of applicability, no matter what order the directives appear in.
| |
| * Interaction with [[relnofollow]]: what does <code>class="robots-follow" rel="nofollow"</code> mean? Currently [[relnofollow]] has no profile URI defined, so the Robot Exclusion Profile takes precedence. In the future, per XMDP's [http://gmpg.org/xmdp/description#multiple Using Multiple Profiles], <q>the URIs in the 'profile' attribute are to be treated most significant (first) to least significant (last).</q>
| |
| | |
| === Phrases ===
| |
| | |
| Modern search engines normally support <i>phrase</i> queries. A phrase query only maches documents that contain the words of the query, consecutively and in the same order. That does beg the question of whether a matched phrase should be allowed to straddle a <code>class="robots-noindex"</code> region.
| |
| | |
| Intuitively this should not be allowed. The phrase query <code>"word1 word2"</code> should not match a document that contains <code>word1 <b class="robots-noindex>ignore</b> word2</code>. This does allow for an interesting tool for webmasters can specify that juxtaposed words not be considered to be phrases -- just specify an empty unindexed region as in <code>word1 <b class="robots-noindex></b> word2</code>.
| |
| | |
| === Specificity ===
| |
| * Does not allow control of specific UAs à la [http://www.robotstxt.org/wc/norobots.html A Standard for Robot Exclusion]
| |
| | |
| If it is actually necessary to control specific UAs here is an possible soluiton.
| |
| Example:
| |
| | |
| <pre><nowiki>
| |
| <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
| |
| <html>
| |
| <head>
| |
| <link rel="schema.RobotExclusion" href="http://example.org/.../" />
| |
| <meta name="RobotExclusion.RobotName1" content="Foo Bot" />
| |
| <meta name="RobotExclusion.RobotName2" content="Bar Bot" />
| |
| <meta name="RobotExclusion.RobotName3" content="Evil Bot" />
| |
| </head>
| |
| <body>
| |
| <h1>Page</h1>
| |
| <p class="robots-noindex">This paragraph shouldn't be indexed by any bot.</p>
| |
| <p class="robot3-noindex">This paragraph should be indexed by every bot except "Evil Bot".</p>
| |
| <p class="robots-noindex robot1-index">This paragraph should only be indexed by "Foo Bot".</p>
| |
| </div>
| |
| </body>
| |
| </html>
| |
| </nowiki></pre>
| |
| Of course it is a waste of bandwith if there are "RobotExclusion.RobotName" meta tags
| |
| on every page of a website. Thus this metatags should be stored on one page - perhaps the
| |
| main page - so they can be maintained easily.
| |
| | |
| <pre><nowiki>
| |
| <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
| |
| <html>
| |
| <head>
| |
| <link rel="schema.RobotExclusion" href="http://example.org/.../" />
| |
| <link rel="RobotExclusion.Names" href="http://mypage.com/" />
| |
| </head>
| |
| <body>
| |
| <h1>Page</h1>
| |
| <p class="robots-noindex">This paragraph shouldn't be indexed by any bot.</p>
| |
| <p class="robot3-noindex">This paragraph should be indexed by every bot except "Evil Bot".</p>
| |
| <p class="robots-noindex robot1-index">This paragraph should only be indexed by "Foo Bot".</p>
| |
| </div>
| |
| </body>
| |
| </html>
| |
| </nowiki></pre>
| |
| | |
| === Keywords ===
| |
| * The keywords <code>all</code> and <code>none</code> are defined by the Robots META Tag as convenience shortcuts to enable or disable the combination of <code>nofollow</code> and <code>noindex</code>, but predate Google's <code>noarchive</code> and should not be considered to include it. As a result, for purposes of clarity and simplicity (the [http://gmpg.org/xmdp/description#principles XMDP Minimalism principle]), they are not included in this version of the Robot Exclusion Profile.
| |
| | |
| === Suitability as a microformat ===
| |
| * Isn't the Robot Exclusion Profile designed for machines first and humans second instead of vice versa? Yes, just as much as [[relnofollow]], the deployed microformat that it's designed to replace.
| |
| * I'd like to echo this concern. We need to discuss whether or not this is a suitable microformat. --[[User:RyanKing|RyanKing]] 13:34, 17 Jan 2006 (PST)
| |
| | |
| === Extension ===
| |
| * As I read this, I had the idea to use this microformat to differentiate the real content of a webpage from the rest (navigation, header, footer, ...) - you could do this by marking the "real content" with the tag "index", but thats not really clear. Maybe you could create a new tag to mark the really important things on the page (the "real content") from the rest. --[[User:Habakuk|Habakuk]] 03:42, 14 Jan 2007 (PST)
| |
| * And another idea is to mark an area of a page as independent from the rest (p.e. for listings of softwaretools - if i search for an software that can do ''a'' and ''b'' i don't want to get a result that offers me a software that can do ''a'' and another that can do ''b''). --[[User:Habakuk|Habakuk]] 03:42, 14 Jan 2007 (PST)
| |