[uf-new] announcing the hOCR and hBIB microformats

Thomas Breuel tmbdev at gmail.com
Tue Mar 27 23:25:36 PST 2007

We're currently developing a new open source OCR system, with a focus on
digital library applications (www.ocropus.org).  As part of this, we needed
formats for representing both OCR output and bibliographic metadata, and we
have defined two new microformats for this purpose: hOCR and hBIB.

hOCR is a format for representing OCR output, including layout information,
character confidences, bounding boxes, and style information. It embeds this
information invisibly in standard HTML. By building on standard HTML, it
automatically inherits well-defined support for most scripts, languages, and
common layout options. Furthermore, unlike previous OCR formats, the
recognized text and OCR-related information co-exist in the same file and
survives editing and manipulation. hOCR markup is independent of the

The hBIB format is a microformat that makes it easy to indicate both where a
document has been published, as well as to indicate references stored within
the document (e.g., for reference lists).  It is a straightforward embedding
of BibTeX into HTML and should also be useful for making available reference
lists and embedding citation information into the output of tools like

