[uf-discuss] Re: Perl microformat parsing

Wed Feb 20 16:05:41 PST 2008

Hey Toby,

the parser is failing on a lot of the pages on the microformats wiki -
seemed like a logical place to point it at to test it 8).

e.g.  not well-formed (invalid token) at line 8, column 2478, byte 3581
at /usr/lib/perl5/site_perl/5.8.8/i386-linux-thread-multi/XML/Parser.pm
line 187

This is because most of the wiki pages include specials like nbsp's.

Here's a patch to prove that this is the problem using a quick and dirty
regex fix:

848d847
<       $html =~ s/\&nbsp\;//igm;

I tried it on both a simple hcard like
http://microformats.org/wiki/User:RobManson and the full hcard page
(which is veeeeery slow to parse) http://microformats.org/wiki/hcard and
the patch fixes it.

A more thorough fix would probably be based upon the notes in 8.3 and
8.4 on the following link.
http://perl-xml.sourceforge.net/faq/#not_well_formed

Hope that's useful 8)

roBman

On Wed, 2008-02-20 at 18:25 +0000, Toby A Inkster wrote:
> Toby A Inkster wrote:
> 
> > For the last week or so, I've been writing the beginnings of what I plan
> > to be a GUI browser in Perl (probably using an embedded Gecko rendering
> > engine, or WebKit if the Perl bindings get sorted out eventually). The
> > browser will have a heavy emphasis on metadata, with information *about*
> > the page being displayed prominently beside the page.
> >
> > Anyway, I've not started work on the GUI and have just been working on
> > the (X)HTML parsing and metadata scraping and thought I'd share my
> > results with you so far.
> 
> An updated version is here:
> 
> 	http://buzzword.org.uk/cognition/cognition-0.1-alpha2.txt
> 
> In recognition that not everyone has Perl installed, and all the right 
> CPAN modules, here's a web-based front-end for the parser:
> 
> 	http://buzzword.org.uk/cognition/cognition-0.1-alpha2.pl
> 
> It supports the following microformats:
> 
> 	* hCard
> 	* geo (plus extensions: body, reference-frame, altitude)
> 	* adr
> 	* hCalendar
> 	* rel-tag
> 	* rel-license
> 	* XOXO
> 	* figure (experimental, based on current brainstorming)
> 
> It supports the official include pattern, plus my own suggested include 
> pattern (using class names beginning with a hash sign). It supports the 
> ABBR pattern, plus Andy Mabbett's proposed "data:"-in-title-attribute 
> pattern (bug fixed on that -- thanks Andy).
> 
> It builds up a document structure based on heading levels, and also 
> includes XOXO lists, figures, and "semantic tables" (any table with either 
> a summary attribute or <caption> element) in the structure.
> 
> It will also scrape non-microformat metadata from:
> 
> 	* HTTP headers
> 	* TITLE element
> 	* META elements
> 	* LINK elements
> 	* eRDF <http://research.talis.com/2005/erdf/wiki/Main/RdfInHtml>
> 	* Role <http://www.w3.org/TR/xhtml-role/>
> 
> and understands metadata namespaces introduced through RFC 2731 compliant 
> LINK elements, and xmlns:FOO attributes.
> 
> If you have Perl on your system, I recommend downloading the Perl script 
> and installing the needed modules from CPAN. The script is tested and 
> working on Linux and Mac OS X. If you don't have Perl, then the web 
> interface should give you a good idea of its parsing results.
> 
> I welcome feedback (either on this list, or directly to my e-mail address) 
> on any problems you find with its parsing. I especially appreciate your 
> feedback supplied as a patch!
> 
> Also any ideas for future direction are welcome.
>