robots-exclusion-issues

(Difference between revisions)

Jump to: navigation, search
(drafted, moved issue(s) from spec to here)
m (moved more issues here)
Line 20: Line 20:
=== 2005 ===
=== 2005 ===
* ''The "efficacy" and "collateral damage" issues from [[rel-nofollow#open_issues|rel="nofollow"]] also apply.''
* ''The "efficacy" and "collateral damage" issues from [[rel-nofollow#open_issues|rel="nofollow"]] also apply.''
 +
 +
==== Precedence ====
 +
* Should earlier values take precedence or later?  Does <code>class="robots-nofollow robots-follow"</code> means the same as <code>class="robots-nofollow"</code> or <code>class="robots-follow"</code>?
 +
* <code>meta</code> tag suggests not using conflicting or repeating directives and so does not specify precedence.  <code>&lt;p class="robots-noindex robot1-index"&gt;</code> is an apparent conflict but in this case the more specific should obviously override the general at its point of applicability, no matter what order the directives appear in.
 +
* Interaction with [[relnofollow]]: what does <code>class="robots-follow" rel="nofollow"</code> mean?  Currently [[relnofollow]] has no profile URI defined, so the Robot Exclusion Profile takes precedence.  In the future, per XMDP's [http://gmpg.org/xmdp/description#multiple Using Multiple Profiles], <q>the URIs in the 'profile' attribute are to be treated most significant (first) to least significant (last).</q>
 +
 +
==== Phrases ====
 +
Modern search engines normally support <i>phrase</i> queries.  A phrase query only maches documents that contain the words of the query, consecutively and in the same order.  That does beg the question of whether a matched phrase should be allowed to straddle a <code>class="robots-noindex"</code> region.
 +
 +
Intuitively this should not be allowed.  The phrase query <code>"word1 word2"</code> should not match a document that contains <code>word1 &lt;b class="robots-noindex&gt;ignore&lt;/b&gt; word2</code>.  This does allow for an interesting tool for webmasters can specify that juxtaposed words not be considered to be phrases -- just specify an empty unindexed region as in <code>word1 &lt;b class="robots-noindex&gt;&lt;/b&gt; word2</code>.
 +
 +
==== Specificity ====
 +
* Does not allow control of specific UAs à la [http://www.robotstxt.org/wc/norobots.html A Standard for Robot Exclusion]
 +
 +
If it is actually necessary to control specific UAs here is an possible soluiton.
 +
Example:
 +
 +
<pre><nowiki>
 +
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
 +
<html>
 +
<head>
 +
<link rel="schema.RobotExclusion" href="http://example.org/.../" />
 +
<meta name="RobotExclusion.RobotName1" content="Foo Bot" />
 +
<meta name="RobotExclusion.RobotName2" content="Bar Bot" />
 +
<meta name="RobotExclusion.RobotName3" content="Evil Bot" />
 +
</head>
 +
<body>
 +
<h1>Page</h1>
 +
<p class="robots-noindex">This paragraph shouldn't be indexed by any bot.</p>
 +
<p class="robot3-noindex">This paragraph should be indexed by every bot except "Evil Bot".</p>
 +
<p class="robots-noindex robot1-index">This paragraph should only be indexed by "Foo Bot".</p>
 +
</div>
 +
</body>
 +
</html>
 +
</nowiki></pre>
 +
Of course it is a waste of bandwith if there are "RobotExclusion.RobotName" meta tags
 +
on every page of a website. Thus this metatags should be stored on one page - perhaps the
 +
main page - so they can be maintained easily.
 +
 +
<pre><nowiki>
 +
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
 +
<html>
 +
<head>
 +
<link rel="schema.RobotExclusion" href="http://example.org/.../" />
 +
<link rel="RobotExclusion.Names" href="http://mypage.com/" />
 +
</head>
 +
<body>
 +
<h1>Page</h1>
 +
<p class="robots-noindex">This paragraph shouldn't be indexed by any bot.</p>
 +
<p class="robot3-noindex">This paragraph should be indexed by every bot except "Evil Bot".</p>
 +
<p class="robots-noindex robot1-index">This paragraph should only be indexed by "Foo Bot".</p>
 +
</div>
 +
</body>
 +
</html>
 +
</nowiki></pre>
 +
 +
==== Keywords ====
 +
* The keywords <code>all</code> and <code>none</code> are defined by the Robots META Tag as convenience shortcuts to enable or disable the combination of <code>nofollow</code> and <code>noindex</code>, but predate Google's <code>noarchive</code> and should not be considered to include it.  As a result, for purposes of clarity and simplicity (the [http://gmpg.org/xmdp/description#principles XMDP Minimalism principle]), they are not included in this version of the Robot Exclusion Profile.
 +
 +
=== 2006 ===
 +
==== Suitability as a microformat ====
 +
* Isn't the Robot Exclusion Profile designed for machines first and humans second instead of vice versa?  Yes, just as much as [[relnofollow]], the deployed microformat that it's designed to replace.
 +
* I'd like to echo this concern. We need to discuss whether or not this is a suitable microformat. --[[User:RyanKing|RyanKing]] 13:34, 17 Jan 2006 (PST)
 +
 +
==== Extension ====
 +
* As I read this, I had the idea to use this microformat to differentiate the real content of a webpage from the rest (navigation, header, footer, ...) - you could do this by marking the "real content" with the tag "index", but thats not really clear. Maybe you could create a new tag to mark the really important things on the page (the "real content") from the rest. --[[User:Habakuk|Habakuk]] 03:42, 14 Jan 2007 (PST)
 +
* And another idea is to mark an area of a page as independent from the rest (p.e. for listings of softwaretools - if i search for an software that can do ''a'' and ''b'' i don't want to get a result that offers me a software that can do ''a'' and another that can do ''b''). --[[User:Habakuk|Habakuk]] 03:42, 14 Jan 2007 (PST)
 +
== template ==
== template ==

Revision as of 01:18, 13 November 2007

robots exclusion issues

Contents

These are externally raised issues about robots exclusion with broadly varying degrees of merit. Thus some issues are REJECTED for a number of obvious reasons (but still documented here in case they are re-raised), and others contain longer discussions. Some issues may be ACCEPTED and perhaps cause changes or improved explanations in the spec.

IMPORTANT: Please read the robots exclusion FAQ before giving any feedback or raising any issues as your feedback/issues may already be resolved/answered.

Submitted issues may (and probably will) be edited and rewritten for better terseness, clarity, calmness, rationality, and as neutral a point of view as possible. Write your issues well. — Tantek

For matters relating to the meta robots specification itself, see meta-robots-errata and meta-robots-suggestions.

closed issues

Resolved issues that have no further actions to take. These will likely be moved to a separate page like robots-exclusion-issues-closed.

resolved issues

Issues that are resolved but may have outstanding to-do items. As issues are resolved, they will be moved from the top of the Issues list to the bottom of this section.

issues

2005

Precedence

Phrases

Modern search engines normally support phrase queries. A phrase query only maches documents that contain the words of the query, consecutively and in the same order. That does beg the question of whether a matched phrase should be allowed to straddle a class="robots-noindex" region.

Intuitively this should not be allowed. The phrase query "word1 word2" should not match a document that contains word1 <b class="robots-noindex>ignore</b> word2. This does allow for an interesting tool for webmasters can specify that juxtaposed words not be considered to be phrases -- just specify an empty unindexed region as in word1 <b class="robots-noindex></b> word2.

Specificity

If it is actually necessary to control specific UAs here is an possible soluiton. Example:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<head>
<link rel="schema.RobotExclusion" href="http://example.org/.../" />
<meta name="RobotExclusion.RobotName1" content="Foo Bot" />
<meta name="RobotExclusion.RobotName2" content="Bar Bot" />
<meta name="RobotExclusion.RobotName3" content="Evil Bot" />
</head>
<body>
<h1>Page</h1>
<p class="robots-noindex">This paragraph shouldn't be indexed by any bot.</p>
<p class="robot3-noindex">This paragraph should be indexed by every bot except "Evil Bot".</p>
<p class="robots-noindex robot1-index">This paragraph should only be indexed by "Foo Bot".</p>
</div>
</body>
</html>

Of course it is a waste of bandwith if there are "RobotExclusion.RobotName" meta tags on every page of a website. Thus this metatags should be stored on one page - perhaps the main page - so they can be maintained easily.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<head>
<link rel="schema.RobotExclusion" href="http://example.org/.../" />
<link rel="RobotExclusion.Names" href="http://mypage.com/" />
</head>
<body>
<h1>Page</h1>
<p class="robots-noindex">This paragraph shouldn't be indexed by any bot.</p>
<p class="robot3-noindex">This paragraph should be indexed by every bot except "Evil Bot".</p>
<p class="robots-noindex robot1-index">This paragraph should only be indexed by "Foo Bot".</p>
</div>
</body>
</html>

Keywords

2006

Suitability as a microformat

Extension


template

Please use this format (copy and paste this to the end of the list to add your issues):

related pages

robots-exclusion-issues was last modified: Wednesday, December 31st, 1969

Views