[uf-discuss] Re: Perl microformat parsing

Toby A Inkster mail at tobyinkster.co.uk
Wed Feb 20 10:25:18 PST 2008

Toby A Inkster wrote:

> For the last week or so, I've been writing the beginnings of what I plan
> to be a GUI browser in Perl (probably using an embedded Gecko rendering
> engine, or WebKit if the Perl bindings get sorted out eventually). The
> browser will have a heavy emphasis on metadata, with information *about*
> the page being displayed prominently beside the page.
> Anyway, I've not started work on the GUI and have just been working on
> the (X)HTML parsing and metadata scraping and thought I'd share my
> results with you so far.

An updated version is here:


In recognition that not everyone has Perl installed, and all the right 
CPAN modules, here's a web-based front-end for the parser:


It supports the following microformats:

	* hCard
	* geo (plus extensions: body, reference-frame, altitude)
	* adr
	* hCalendar
	* rel-tag
	* rel-license
	* figure (experimental, based on current brainstorming)

It supports the official include pattern, plus my own suggested include 
pattern (using class names beginning with a hash sign). It supports the 
ABBR pattern, plus Andy Mabbett's proposed "data:"-in-title-attribute 
pattern (bug fixed on that -- thanks Andy).

It builds up a document structure based on heading levels, and also 
includes XOXO lists, figures, and "semantic tables" (any table with either 
a summary attribute or <caption> element) in the structure.

It will also scrape non-microformat metadata from:

	* HTTP headers
	* TITLE element
	* META elements
	* LINK elements
	* eRDF <http://research.talis.com/2005/erdf/wiki/Main/RdfInHtml>
	* Role <http://www.w3.org/TR/xhtml-role/>

and understands metadata namespaces introduced through RFC 2731 compliant 
LINK elements, and xmlns:FOO attributes.

If you have Perl on your system, I recommend downloading the Perl script 
and installing the needed modules from CPAN. The script is tested and 
working on Linux and Mac OS X. If you don't have Perl, then the web 
interface should give you a good idea of its parsing results.

I welcome feedback (either on this list, or directly to my e-mail address) 
on any problems you find with its parsing. I especially appreciate your 
feedback supplied as a patch!

Also any ideas for future direction are welcome.

Toby A Inkster BSc (Hons) ARCS
[Geek of HTML/SQL/Perl/PHP/Python/Apache/Linux]
[OS: Linux, up 22 days, 29 min.]

                               Bottled Water

More information about the microformats-discuss mailing list