[uf-discuss] format for identifiers?

C. Hudley chudley at gmail.com
Tue Nov 29 09:30:50 PST 2005

On 11/28/05, Simon Kittle <si at kittle.co.uk> wrote:
> So it's not that he was after a way to identify these things (for which
> there exists a URI scheme already) but a way to put them into context, like
> grouping a bunch of tags under class='vcard' and marking one class='url'
> puts it into context.

Right, that's what I'm after.  I'm all for URIs.  I'm all for the info
protocol, and already use it internally.  I *don't* want a new syntax
for identifiers.

What I'm looking or is a standard way to indicate which identifiers
being *presented to humans* on a web page are relevant identifiers for
the items on that page.

I disagree that "an identifier isn't useful if you don't know what it
is".  There are many identifier systems (DOI, UUID, etc.) in
widespread use that are generic in presentation.  The benefit of those
systems is that the resolvable meaning of the identifiers can be
equivalent across systems.  I know there's no need to explain that
here, but it seems important to clarify (given the discussion) that in
the context of wiring arbitrary systems together web2.0-style, when
you are potentially moving objects around across the boundaries of
individual webapps, it can be very useful to base connections on
arbitrarily-human-meaningful yet unambiguously-resolvable with
equivalent meaning across contexts.

Since I'm new and obviously struggling to make my case I'll just go to
our use cases (where it seems I should have started, sorry).  It'll
take several paragraphs, bear with me, can't help it. :\

In libraries we use the OpenURL (ANSI/NISO Z39.88) specs to pass
content references by-reference (using identifiers) or by-value (using
defined attribute-value sets per media type) across systems.  It's
been mentioned on-list recently so I won't go into great detail, but
here's a summary:  the immediate scenario for which this spec was
designed was:  you're a scholar doing research online.  You read a
relevant article in Journal A from Publisher1 and need to chase down
its references.  Of 25 references 22 of them are in different journals
from different publishers, so you *want* to click right to those
articles.  But since university libraries subscribe to various content
packages from various publishers and those subscriptions vary widely
among institutions, you need a way to (a) pass the reference to your
library, (b) have the library figure out which subscription/interface
has the content, and (c) get to the article online in another system,
or request through interlibrary loan, or save the reference in your
folder, etc.  In this scenario the ability to pass identifiers across
systems is a huge win because otherwise you can only match on
field/value pairs you get from the original source, which vary widely
(author names, titles, vol/iss/page/yr/etc.).  The OpenURL spec
details how to pass that information in a URL, with definitions for
how to specify ContextObjects, aka "the things [usually references]
you want to do something with", in GET strings or POSTed XML.

So the problem with OpenURL *implementations* is that every publisher
who supports OpenURL-style linking publishes their OpenURL links using
inconsistent HTML *human*-readable formats.  You'd think OpenURL
should provide great leverage for rewiring apps -- it certainly could
-- but the upshot is that if you can't identfy which bits on a page
comprise the OpenURL you can't write software that rewrites it
usefully.  To deal with this problem we've written an ad-hoc spec
called "COinS", i.e. "ContextObjects in Spans".  This spec says: "put
your ContextObjects in HTML span elements with a class value of
'Z3988'", and nothing more.  It's an anti-microformat, in a way, since
ContextObjects themselves are almost never human-readable.  In any
case, COinS have been implemented in numerous systems:  CiteULike,
Citebase, some online journals, unalog, weblogs including wordpress
and pyblosxom.

The main first benefit of COinS is that if you're a small publisher or
a weblogger, you can just put COinS in your pages and people at
institutions who can resolve OpenURLs (most major universities and
many large libraries and corporations) to actual articles.  Like, talk
about a research article in your blog, and your readers can link to
the articles at their libraries.  To support this we've generated ~900
institution-specific COinS-resolving bookmarklets and greasemonkey
userscripts based on data from an OCLC international registry of
OpenURL resolvers on a trial basis and it works nicely, for a demo.

But to the point:  lately some of us have experimented with doing more
with COinS.  Since OpenURL specifies where to put object identifiers,
and many of our systems have OAI-PMH interfaces that let you get
metadata for a given identifier through a simple GET, why not metadata
autodiscovery?  It's easy to wire up identifiers, as specified in
COinS, to relevant OAI-PMH services, with their URLs in link tags, so
you can script access to robust metadata for objects on a page right
from within the browser.  To restate, more simply:  for any content we
publish on any site, if you specify identifiers for items on a page
and how to query for more information about those items using the
identifiers, you can pull metadata for those items with a simple AJAX
call.  This could be huge leverage in wiring new systems together.

To demonstrate all this, I wrote a greasemonkey script that looks for
COinS-with-identifiers and link tags for OAI-PMH services.  When it
finds those both (a combination we started calling "COinS-PMH",
because we suck at naming things), it pops up a left-side list of
links to all the metadata records for the items on the page.  This
doesn't do much besides giving a visual indicator of what's possible. 
But, we're working on extending various personal-collection-system
thingies we work on to support this stuff and do things more
magically.  Basically, imagine if a tool (like, say, flock) could suck
down not just a page link and its content or an image with a link but
specific hand-picked objects from pages, replete with complete,
complex metadata records for those objects.  And not just from a few
big name sites with robust, distinct APIs, but from any site, using
one simple API.

With more greasemonkey userscripts we've tweaked Amazon, Flickr,
Google Books, American Memory collections at LoC, arxiv.org,
wordpress, and unalog to speak COinS-PMH and thus give up metadata for
identified objects.  Screenshots here:


The Amazon, Flickr, and Google Books tweaks use a faux-OAI-PMH proxy
that responds to a few OAI-PMH queries with metadata retrieved using
the Amazon and Flickr APIs.

To sum:

 - there are *lots* of arbitrary things people can do with items on
web pages if those items are managed objects in their own right with
their own identifiers.

 - the identifiers need to be clearly demarcated in a way that
distinguishes which identifier goes with which item on the page (think
browse views or search result lists).

 - if they're not clearly marked there's no way to know which URIs
embedded in the page go with which objects.

 - identifiers themselves can be in any format... URIs preferred.

 - we have a way to do the above, but it's unmicroformatic.  OpenURL
ContextObjects, and by association, COinS, are unfortunately
overcomplex and intended for machine, not human consumption.

 - some of the benefits of using COinS could be accomplished more
easily with a simpler convention for demarcating which items on a page
have which symbolically meaningful content identifiers.  Something as
easy as "class='identifier'" might do it.

For the two of you who haven't deleted this and moved on long ago, some links:

    - more details about COinS, with links to OpenURL specs and COinS

    - OAI-PMH

    - some samples of how an identifier microformat could look, and
work, with COinS

    - an example of how OpenURL profiles for articles, books, etc.
might translate to

Thank you for your consideration... really!

More information about the microformats-discuss mailing list