[uf-discuss] Re: Perl microformat parsing
Toby A Inkster
mail at tobyinkster.co.uk
Wed Feb 20 10:25:18 PST 2008
Toby A Inkster wrote:
> For the last week or so, I've been writing the beginnings of what I plan
> to be a GUI browser in Perl (probably using an embedded Gecko rendering
> engine, or WebKit if the Perl bindings get sorted out eventually). The
> browser will have a heavy emphasis on metadata, with information *about*
> the page being displayed prominently beside the page.
>
> Anyway, I've not started work on the GUI and have just been working on
> the (X)HTML parsing and metadata scraping and thought I'd share my
> results with you so far.
An updated version is here:
http://buzzword.org.uk/cognition/cognition-0.1-alpha2.txt
In recognition that not everyone has Perl installed, and all the right
CPAN modules, here's a web-based front-end for the parser:
http://buzzword.org.uk/cognition/cognition-0.1-alpha2.pl
It supports the following microformats:
* hCard
* geo (plus extensions: body, reference-frame, altitude)
* adr
* hCalendar
* rel-tag
* rel-license
* XOXO
* figure (experimental, based on current brainstorming)
It supports the official include pattern, plus my own suggested include
pattern (using class names beginning with a hash sign). It supports the
ABBR pattern, plus Andy Mabbett's proposed "data:"-in-title-attribute
pattern (bug fixed on that -- thanks Andy).
It builds up a document structure based on heading levels, and also
includes XOXO lists, figures, and "semantic tables" (any table with either
a summary attribute or <caption> element) in the structure.
It will also scrape non-microformat metadata from:
* HTTP headers
* TITLE element
* META elements
* LINK elements
* eRDF <http://research.talis.com/2005/erdf/wiki/Main/RdfInHtml>
* Role <http://www.w3.org/TR/xhtml-role/>
and understands metadata namespaces introduced through RFC 2731 compliant
LINK elements, and xmlns:FOO attributes.
If you have Perl on your system, I recommend downloading the Perl script
and installing the needed modules from CPAN. The script is tested and
working on Linux and Mac OS X. If you don't have Perl, then the web
interface should give you a good idea of its parsing results.
I welcome feedback (either on this list, or directly to my e-mail address)
on any problems you find with its parsing. I especially appreciate your
feedback supplied as a patch!
Also any ideas for future direction are welcome.
--
Toby A Inkster BSc (Hons) ARCS
[Geek of HTML/SQL/Perl/PHP/Python/Apache/Linux]
[OS: Linux 2.6.17.14-mm-desktop-9mdvsmp, up 22 days, 29 min.]
Bottled Water
http://tobyinkster.co.uk/blog/2008/02/18/bottled-water/
More information about the microformats-discuss
mailing list