[uf-discuss] Re: Perl microformat parsing
Rob Manson
roBman at MobileOnlineBusiness.com.au
Wed Feb 20 16:05:41 PST 2008
Hey Toby,
the parser is failing on a lot of the pages on the microformats wiki -
seemed like a logical place to point it at to test it 8).
e.g. not well-formed (invalid token) at line 8, column 2478, byte 3581
at /usr/lib/perl5/site_perl/5.8.8/i386-linux-thread-multi/XML/Parser.pm
line 187
This is because most of the wiki pages include specials like nbsp's.
Here's a patch to prove that this is the problem using a quick and dirty
regex fix:
848d847
< $html =~ s/\ \;//igm;
I tried it on both a simple hcard like
http://microformats.org/wiki/User:RobManson and the full hcard page
(which is veeeeery slow to parse) http://microformats.org/wiki/hcard and
the patch fixes it.
A more thorough fix would probably be based upon the notes in 8.3 and
8.4 on the following link.
http://perl-xml.sourceforge.net/faq/#not_well_formed
Hope that's useful 8)
roBman
On Wed, 2008-02-20 at 18:25 +0000, Toby A Inkster wrote:
> Toby A Inkster wrote:
>
> > For the last week or so, I've been writing the beginnings of what I plan
> > to be a GUI browser in Perl (probably using an embedded Gecko rendering
> > engine, or WebKit if the Perl bindings get sorted out eventually). The
> > browser will have a heavy emphasis on metadata, with information *about*
> > the page being displayed prominently beside the page.
> >
> > Anyway, I've not started work on the GUI and have just been working on
> > the (X)HTML parsing and metadata scraping and thought I'd share my
> > results with you so far.
>
> An updated version is here:
>
> http://buzzword.org.uk/cognition/cognition-0.1-alpha2.txt
>
> In recognition that not everyone has Perl installed, and all the right
> CPAN modules, here's a web-based front-end for the parser:
>
> http://buzzword.org.uk/cognition/cognition-0.1-alpha2.pl
>
> It supports the following microformats:
>
> * hCard
> * geo (plus extensions: body, reference-frame, altitude)
> * adr
> * hCalendar
> * rel-tag
> * rel-license
> * XOXO
> * figure (experimental, based on current brainstorming)
>
> It supports the official include pattern, plus my own suggested include
> pattern (using class names beginning with a hash sign). It supports the
> ABBR pattern, plus Andy Mabbett's proposed "data:"-in-title-attribute
> pattern (bug fixed on that -- thanks Andy).
>
> It builds up a document structure based on heading levels, and also
> includes XOXO lists, figures, and "semantic tables" (any table with either
> a summary attribute or <caption> element) in the structure.
>
> It will also scrape non-microformat metadata from:
>
> * HTTP headers
> * TITLE element
> * META elements
> * LINK elements
> * eRDF <http://research.talis.com/2005/erdf/wiki/Main/RdfInHtml>
> * Role <http://www.w3.org/TR/xhtml-role/>
>
> and understands metadata namespaces introduced through RFC 2731 compliant
> LINK elements, and xmlns:FOO attributes.
>
> If you have Perl on your system, I recommend downloading the Perl script
> and installing the needed modules from CPAN. The script is tested and
> working on Linux and Mac OS X. If you don't have Perl, then the web
> interface should give you a good idea of its parsing results.
>
> I welcome feedback (either on this list, or directly to my e-mail address)
> on any problems you find with its parsing. I especially appreciate your
> feedback supplied as a patch!
>
> Also any ideas for future direction are welcome.
>
More information about the microformats-discuss
mailing list