[uf-discuss] hCite elevator pitch and my bibliography generator

Henri Sivonen hsivonen at iki.fi
Fri Mar 23 16:48:55 PST 2007


On Mar 23, 2007, at 14:22, Paul Wilkins wrote:

> Henri Sivonen wrote:
>> On Mar 10, 2007, at 23:10, Paul Wilkins wrote:
>>> You are using the BibTex format, which is covered in  the  
>>> citation formats http://microformats.org/wiki/citation-formats
>> Sure, but considering that I share my .bib, should I expect people  
>> to  want to scrape my (X)HTML-formatted bibliography?
>
> If the .bib is used as the lone source for the XHTML, I suspect it  
> would be easier to scrape the .bib file.

It is the lone source.

>>> The citation microformat is a work in progress at this stage, so   
>>> it's not mature enough for programs to extract information from it,
>> I guess this means that I shouldn't try to support hCite on the   
>> generator side in my thesis considering that the document should  
>> go  final on the first week of April.
>
> Even though it goes final then, does that prevent you from later on  
> adding markup which doesn't affect the text, yet makes it easier  
> for tools to scrape through the information?

There's no technical barrier to updating the file, but as a matter of  
archival principle, it seems wrong to tamper with the dated file  
later on. Tampering with it could lower the confidence of readers in  
the stability of the file as a version that corresponds exactly to  
the official paper version in the university department library.

>> Would it be of any use to anyone if I wrapped the name of each  
>> author/ editor in a <span class='fn'> if I otherwise leave my  
>> markup the way  it is now?
>
> A formatted name is quite a restricted format, and if the formatted  
> name doesn't follow a certain prescribed format, it is considered  
> to be invalid and isn't used.

What about class='n'?

> Currently the BibTeX is as follows
>
> @Misc{AXML,
>   editor = {Tim Bray and Jean Paoli and C.M. Sperberg-McQueen},
>   title = {The Annotated XML 1.0 Specification},
>   year = 	 {1998},
>   publisher = {O'Reilly Media, Inc.},
>   refdate = {2007-03-04},
>   url = {http://www.xml.com/pub/a/axml/axmlintro.html}
> }
>
> From which you are wanting to create the following kind of data.
>
> [AXML]
>     The Annotated XML 1.0 Specification. Tim Bray, Jean Paoli and  
> C.M. Sperberg-McQueen, editors. O’Reilly Media, Inc., 1998. http:// 
> www.xml.com/pub/a/axml/axmlintro.html (referenced: 2007-03-04)
>
> The editor section alone will be interesting to markup, because the  
> citation will have to allow multiple editors, in which case both  
> the BibTeX and the microformat will have to be created from a  
> parent source, so that the microformat can gain the name-based  
> information in the format required, while still allowing that  
> information through to become the BibTeX file.

The current markup is:
<dt id="ref-AXML">[AXML]</dt><dd><p><a href="http://www.xml.com/pub/a/ 
axml/axmlintro.html"><cite class="title">The Annotated XML 1.0  
Specification</cite></a>. <span class="editor">Tim Bray, Jean Paoli  
and C.M. Sperberg-McQueen</span>, editors. <span  
class="publisher">O’Reilly Media, Inc.</span>, <span  
class="year">1998</span>. <span class="urlwrap"><a href="http:// 
www.xml.com/pub/a/axml/axmlintro.html" class="url">http://www.xml.com/ 
pub/a/axml/axmlintro.html</a> (referenced: <span  
class="refdate">2007-03-04</span>)</span></p></dd>

So to extract the editors from <span class="editor">Tim Bray, Jean  
Paoli and C.M. Sperberg-McQueen</span>, one would need to know that  
it is OK to split on ", " and " and ". I could wrap each name in a  
span with a class. But what class?

>>> The benefits are that people visitng your content with next   
>>> generation tools wil be able to easily extract and use the   
>>> information in more interesting and useful ways.
>> So basically, my effort would not be about catering to specific   
>> realistic foreseeable use cases. Instead, it would be about  
>> putting  data out there in case someone figures out a use case  
>> later on.
>
> It may be more useful to provide the ISBN number for the book. Then  
> the problems left to be solved become smaller and easier to handle.

Sorry, but I don't follow. What's "the book"?

>> Somehow, I was under the impression that hCite required  
>> bibliography  items as <li>s instead of <dt>/<dd> pairs (which is  
>> what I use and  what W3C and WHATWG specs use).
>
> I'm sure that design patterns can be created to accommodate such a  
> scheme.
>
>> What I'm trying to say is that I think hCite should allow names to  
>> be  marked up as formatted names tossing the deformatting problem  
>> to the  consumer. After all, one of the most popular bibliography  
>> data  format, BibTeX, stores formatted names.
>
> Currently the formatted names are accepted in the following formats
>
> given-name (space) family-name
> family-name (comma) given-name
> family-name (comma) given-name-first-initial
> family-name (space) given-name-first-initial (optional period)

In the .bib, there are names that don't follow those formats. For  
example:
Michael I. Schwartzbach
C. M. Sperberg-McQueen
Arnaud Le Hors
Simon St. Laurent
Håkon Wium Lie
Geert Jan Bex
Jan Van den Bussche
Henrik Frystyk Nielsen
Roy Thomas Fielding

I'd rather not develop heuristics in my end for properly guessing the  
semantics of the middle tokens. Especially when it is uncertain if  
anyone will ever scape my bibliography with a tool that both needed  
and understood semantic encoding of the different parts of the name.

> How much granularity does BibTeX allow for when storing the  
> formated names for Editors?

I don't know what's officially allowed. My generator uses UTF-8- 
encoded .bib, which is already stretching things a bit.

In general, you just write names in .bib in the usual little-endian  
European format and BibTeX does some magic. I guess that in the real  
BibTeX, there are heuristics for Dutch surnames (van der) and the like.

I am using a format where the given name comes first and the family  
name comes last and whatever middle stuff is provided in the name of  
the author on the work itself goes in the middle. C. M. Sperberg- 
McQueen mentioned above is an exception, because the first name is  
deliberately not provided in full. The output format I use is  
intentionally lossless, so the given name isn't truncated and the  
name surname isn't reordered.

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/





More information about the microformats-discuss mailing list