[uf-discuss] [citation] url field

Sat Dec 2 18:32:32 PST 2006

A couple points on this subject. I have recently been doing a *lot* of
research in the area of URLs/URIs and having discussions with numerous
people on REST-discuss and www-TAG lists so I feel I'm pretty well-versed on
this subject now.

Although it is possible to infer an ISBN or maybe even a DOI from a URL, it
is considered "Bad Practice" unless the "URI Authority" (i.e. owner of the
website) specifically documented the structure of the URL and gave a
reasonably trustworthy guarantee that it will not change.  

References:

1.) "Architecture of the World Wide Web, Volume One" section 2.5 on "URI
Opacity" [1]:

	Good practice: URI opacity
	Agents making use of URIs SHOULD NOT attempt to infer properties of
the referenced resource.

2.) "The use of Metadata in URIs" section 2.1 on "Reliability of URI
metadata" [2]

	Constraint: Web software MUST NOT depend on the correctness of
metadata 
	inferred from a URI, except when the encoding of such metadata is
documented 
	by applicable standards and specifications. 

3.) "The use of Metadata in URIs" section 2.1 on "Reliability of URI
metadata" [2]

	The principle conclusions of this finding are:

	* Assignment authorities may publish specifications detailing the
structure and 
	semantics of the URIs they assign. Other users of those URIs may use
such 
	specifications to infer information about resources identified by
URI assigned by 
	that authority.

	* People and software using URIs assigned outside of their own
authority should 
	make as few inferences as possible about a resource based on its
URI. The more 
	dependencies a piece of software has on particular constraints and
inferences, 
	the more fragile it becomes to change and the lower its generic
utility.

In the case of Jon Udel's LibraryLookup which as been referenced as an
example:

	Data point: ISBNs are already being reliably extracted from URLs:

http://weblog.infoworld.com/udell/stories/2002/12/11/librarylookup.html

Jon's work has been derided by purists as doing something it shouldn't i.e.
"peeking" into URLs when they should remain opaque. Personally, I don't see
what Jon did as such a bad thing. Jon's script interfaces with a human only,
and if Amazon ever changes their URLs his script just won't work and the
user will figure that out. In the mean time by breaking the rules he's
offering pretty useful functionality that he couldn't get otherwise.  And
even Amazon does changes their URLs and his script breaks, which is highly
unlikely given their affiliate program, Jon can just update his script and
then anyone who has a broken script can search for Jon's new version (unless
Amazon eliminates the ISBN from the URL, which I would highly doubt.)

However, advocating the use of non-document metadata in a URL for a
Microformat citation is a completely different matter. Rather than one
author (Jon Udell) using it for one app (LibraryLookup) where it's users can
later get updates if required, advocating it for a Microformat where authors
will markup untold HTML content, much of which will never get updated for
future revisions requires a very high bar for immutability. IOW, we should
ensure that we have a *guarantee* that the format of the URL will *never* or
we shouldn't use it. Yes we *could* still parse the old format, but we'd
have to continue adding parsers some of which might eventually fail for
ambiguity.

At the moment, the only immutable reference for an ISBN is a URN from RFC
3187[4]. For example:

	URN:ISBN:0-395-36341-1

This doesn't deference in a browser, if used in IE7 for example, but one day
it might. But we can be sure it is definitely immutable.

As for resolving DOIs, they are new to me and I've not done enough research
to determine if there is an immutable resolvable source for DOIs.  This
article[5] and these websites ([6] & [7]) might be helpful there.

As an aside, please don't take this as me being unsupportive.  On the
contrary, I am a strong advocate to get website owners to put metadata in
their URLs and to document that metadata. However, until we have solid
sources of URLs with documented metadata, we should probably all play
smartly by the rules as specified by the W3C, at least IMO.

-Mike Schinkel
http://www.mikeschinkel.com/blogs/
http://www.welldesignedurls.org/

[1] http://www.w3.org/TR/webarch/#uri-opacity
[2] http://www.w3.org/2001/tag/doc/metaDataInURI-31-20061107.html
[3] http://www.w3.org/2001/tag/doc/metaDataInURI-31-20061107.html#N1023D
[4] http://www.ietf.org/rfc/rfc3187.txt
[5] http://www.dlib.org/dlib/june98/06powell.html
[6] http://www.handle.net/
[7] http://www.doi.org/