process, [citation] (was Re: [uf-new] announcing the hOCR and hBIB microformats)

Thomas Breuel tmbdev at gmail.com
Tue Apr 3 06:40:58 PDT 2007


> That working format is currently very close to doing everything that a
> bibtex-based format can do. I have code in BibDesk that generates that
> working format and reads it in and translates it to bibtex.


I don't see how either of those statements can be true.  For example, BibTeX
incorporates a lot of knowledge and constraints about what kinds of
documents can be cited and what fields mean for those cases, but Straw
doesn't define that.  Furthermore, Straw ignores the issue of markup (math,
chemistry, mixed scripts/styles, etc.) in citations.  And Straw attempts to
enforce the use of semantic markup for fields like dates. As a result,
different converters might convert BibTeX->Straw and Straw->BibTeX
inconsistently, conversions may not render correctly, and, worse yet,
converters often probably won't even notice when they're altering
citations.   In practice (and I've seen this a number of times), converters
just end up handling the common cases and leave the rest to manual cleanup.

The basic problem with Straw is that, at the same time as defining how to
encapsulate citations in HTML, it introduces yet another citation format
with yet slightly different semantics from all previous citation formats; no
citation manager understands Straw semantics and there is no existing
practice.  I think encapsulating BibTeX and other formats in microformats is
a better approach, since existing citation managers already know how to deal
with BibTeX semantics and the problem reduces to one of syntactic and markup
conversion (which is hard enough).

In any case, in the short term, we've already independently decided to go
with Dublin Core for the document metadata for the current collection of
documents we're dealing with (and DC already has well-defined embeddings in
HTML), so this issue is less pressing.  But in the long term, when we will
be recognizing and extracting citations from OCR'ed papers, it will be
impossible to fully conform with a format like Straw.

I'll add some more comments to the Wiki.

Thomas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://microformats.org/discuss/mail/microformats-new/attachments/20070403/d1aba4f6/attachment-0001.html


More information about the microformats-new mailing list