[uf-discuss] [citation] url field

Mon Dec 4 13:48:04 PST 2006

On 12/2/06, Mike Schinkel <mikeschinkel at gmail.com> wrote:
> A couple points on this subject. I have recently been doing a *lot* of
> research in the area of URLs/URIs and having discussions with numerous
> people on REST-discuss and www-TAG lists so I feel I'm pretty well-versed on
> this subject now.
>
> Although it is possible to infer an ISBN or maybe even a DOI from a URL, it
> is considered "Bad Practice" unless the "URI Authority" (i.e. owner of the
> website) specifically documented the structure of the URL and gave a
> reasonably trustworthy guarantee that it will not change.
>
> References:
>
> 1.) "Architecture of the World Wide Web, Volume One" section 2.5 on "URI
> Opacity" [1]:
>
>         Good practice: URI opacity
>         Agents making use of URIs SHOULD NOT attempt to infer properties of
> the referenced resource.
>
> 2.) "The use of Metadata in URIs" section 2.1 on "Reliability of URI
> metadata" [2]
>
>         Constraint: Web software MUST NOT depend on the correctness of
> metadata
>         inferred from a URI, except when the encoding of such metadata is
> documented
>         by applicable standards and specifications.
>
> 3.) "The use of Metadata in URIs" section 2.1 on "Reliability of URI
> metadata" [2]
>
>         The principle conclusions of this finding are:
>
>         * Assignment authorities may publish specifications detailing the
> structure and
>         semantics of the URIs they assign. Other users of those URIs may use
> such
>         specifications to infer information about resources identified by
> URI assigned by
>         that authority.
>
>         * People and software using URIs assigned outside of their own
> authority should
>         make as few inferences as possible about a resource based on its
> URI. The more
>         dependencies a piece of software has on particular constraints and
> inferences,
>         the more fragile it becomes to change and the lower its generic
> utility.
>
> In the case of Jon Udel's LibraryLookup which as been referenced as an
> example:
>
>         Data point: ISBNs are already being reliably extracted from URLs:
>
> http://weblog.infoworld.com/udell/stories/2002/12/11/librarylookup.html
>
> Jon's work has been derided by purists as doing something it shouldn't i.e.
> "peeking" into URLs when they should remain opaque. Personally, I don't see
> what Jon did as such a bad thing. Jon's script interfaces with a human only,
> and if Amazon ever changes their URLs his script just won't work and the
> user will figure that out. In the mean time by breaking the rules he's
> offering pretty useful functionality that he couldn't get otherwise.  And
> even Amazon does changes their URLs and his script breaks, which is highly
> unlikely given their affiliate program, Jon can just update his script and
> then anyone who has a broken script can search for Jon's new version (unless
> Amazon eliminates the ISBN from the URL, which I would highly doubt.)
>
> However, advocating the use of non-document metadata in a URL for a
> Microformat citation is a completely different matter. Rather than one
> author (Jon Udell) using it for one app (LibraryLookup) where it's users can
> later get updates if required, advocating it for a Microformat where authors
> will markup untold HTML content, much of which will never get updated for
> future revisions requires a very high bar for immutability. IOW, we should
> ensure that we have a *guarantee* that the format of the URL will *never* or
> we shouldn't use it. Yes we *could* still parse the old format, but we'd
> have to continue adding parsers some of which might eventually fail for
> ambiguity.
>
> At the moment, the only immutable reference for an ISBN is a URN from RFC
> 3187[4]. For example:
>
>         URN:ISBN:0-395-36341-1
>
> This doesn't deference in a browser, if used in IE7 for example, but one day
> it might. But we can be sure it is definitely immutable.
>
> As for resolving DOIs, they are new to me and I've not done enough research
> to determine if there is an immutable resolvable source for DOIs.  This
> article[5] and these websites ([6] & [7]) might be helpful there.
>
> As an aside, please don't take this as me being unsupportive.  On the
> contrary, I am a strong advocate to get website owners to put metadata in
> their URLs and to document that metadata. However, until we have solid
> sources of URLs with documented metadata, we should probably all play
> smartly by the rules as specified by the W3C, at least IMO.
>
> -Mike Schinkel
> http://www.mikeschinkel.com/blogs/
> http://www.welldesignedurls.org/
>
> [1] http://www.w3.org/TR/webarch/#uri-opacity
> [2] http://www.w3.org/2001/tag/doc/metaDataInURI-31-20061107.html
> [3] http://www.w3.org/2001/tag/doc/metaDataInURI-31-20061107.html#N1023D
> [4] http://www.ietf.org/rfc/rfc3187.txt
> [5] http://www.dlib.org/dlib/june98/06powell.html
> [6] http://www.handle.net/
> [7] http://www.doi.org/
>

Mike, thanks for all the detail. I definitely learned some things.

In the context of my original proposal to add a "URL" field to the
microformat, I now feel like I need to separate that proposal from one
of the statements I made in it:

"I also suggest that in the case of identifiers like a DOI or ISBN
which can be represented as a parameter in a link to doi.org or some
other resolver, that the format encourage using a URL field for those
identifiers and not include separate fields for each such identifier.
In other words, I think that class="url uid"  is sufficient to encode
DOI/ISBN/etc., and we shouldn't add a separate DOI class, a separate
ISBN class, and so on.
"

To be clear - I still think that *if* it is possible to mark up a DOI
or ISBN as a link without obscuring the DOI, then that's a positive
thing. It sounds like it's just more complicated than I thought to do
that. So maybe the format doesn't need to mention those in connection
with the URL field.

I do think that a URL field (class="url") should be included, to
represent a link to a copy of the cited work, and if we want to mark
up one or more identifiers, we can use a separate class (I suggest
"uid") to do so. If we're lucky and there's a good way to merge them,
then use class="url uid".

I'd like to get feedback on whether or not the list likes the idea of
a URL field as outlined above - separate from the issue of URNs and
metadata recovery.

The use case I'm focused on is here:
http://microformats.org/wiki/citation-brainstorming#Acquiring_reference_information_from_the_web

Thanks,
-mike

-- 
Michael McCracken
UCSD CSE PhD Candidate
research: http://www.cse.ucsd.edu/~mmccrack/
misc: http://michael-mccracken.net/wp/