[uf-discuss] Perl microformat parsing

Toby A Inkster mail at tobyinkster.co.uk
Fri Feb 15 09:42:10 PST 2008

Dear all,

For the last week or so, I've been writing the beginnings of what I plan 
to be a GUI browser in Perl (probably using an embedded Gecko rendering 
engine, or WebKit if the Perl bindings get sorted out eventually). The 
browser will have a heavy emphasis on metadata, with information *about* 
the page being displayed prominently beside the page.

Anyway, I've not started work on the GUI and have just been working on the 
(X)HTML parsing and metadata scraping and thought I'd share my results 
with you so far.

The code is capable of parsing the following Microformats:

	* hCard (except categories)
	* geo
	* adr
	* hCalendar (except categories)
	* (include pattern)
	* (abbr pattern)

Additionally it supports certain proposed extensions to microformats: my 
proposed alternative include pattern; Andy Mabbett's proposed "data:" 
prefix for abbr titles; and the "body" and "reference-frame" components to 

It will also scrape non-microformat metadata from:

	* HTTP headers
	* TITLE element
	* META elements
	* LINK elements
	* eRDF <http://research.talis.com/2005/erdf/wiki/Main/RdfInHtml>
	* Role <http://www.w3.org/TR/xhtml-role/>

and understands metadata namespaces introduced through RFC 2731 compliant 
LINK elements.

Currently it's terminal based, takes a single URL specified as a command 
line parameter, requests and parses the document, and dumps out the parsed 
data in a kinda readable format.

Download alpha 1 here:

It needs a bunch of Perl modules, but they're all in CPAN.

I welcome feedback (either on this list, or directly to my e-mail address) 
on any problems you find with its parsing. I especially appreciate your 
feedback supplied as a patch!

Also any ideas for future direction are welcome.

Toby A Inkster BSc (Hons) ARCS
[Geek of HTML/SQL/Perl/PHP/Python/Apache/Linux]
[OS: Linux, up 16 days, 23:44.]

                             Mince & Dumplings

More information about the microformats-discuss mailing list