[uf-discuss] Perl microformat parsing
Toby A Inkster
mail at tobyinkster.co.uk
Fri Feb 15 09:42:10 PST 2008
For the last week or so, I've been writing the beginnings of what I plan
to be a GUI browser in Perl (probably using an embedded Gecko rendering
engine, or WebKit if the Perl bindings get sorted out eventually). The
browser will have a heavy emphasis on metadata, with information *about*
the page being displayed prominently beside the page.
Anyway, I've not started work on the GUI and have just been working on the
(X)HTML parsing and metadata scraping and thought I'd share my results
with you so far.
The code is capable of parsing the following Microformats:
* hCard (except categories)
* hCalendar (except categories)
* (include pattern)
* (abbr pattern)
Additionally it supports certain proposed extensions to microformats: my
proposed alternative include pattern; Andy Mabbett's proposed "data:"
prefix for abbr titles; and the "body" and "reference-frame" components to
It will also scrape non-microformat metadata from:
* HTTP headers
* TITLE element
* META elements
* LINK elements
* eRDF <http://research.talis.com/2005/erdf/wiki/Main/RdfInHtml>
* Role <http://www.w3.org/TR/xhtml-role/>
and understands metadata namespaces introduced through RFC 2731 compliant
Currently it's terminal based, takes a single URL specified as a command
line parameter, requests and parses the document, and dumps out the parsed
data in a kinda readable format.
Download alpha 1 here:
It needs a bunch of Perl modules, but they're all in CPAN.
I welcome feedback (either on this list, or directly to my e-mail address)
on any problems you find with its parsing. I especially appreciate your
feedback supplied as a patch!
Also any ideas for future direction are welcome.
Toby A Inkster BSc (Hons) ARCS
[Geek of HTML/SQL/Perl/PHP/Python/Apache/Linux]
[OS: Linux 188.8.131.52-mm-desktop-9mdvsmp, up 16 days, 23:44.]
Mince & Dumplings
More information about the microformats-discuss