[uf-discuss] Re: Perl microformat parsing

Sat Mar 1 07:26:05 PST 2008

On Sat, Feb 23, 2008 at 11:50 PM, Toby A Inkster <mail at tobyinkster.co.uk> wrote:
>  > Web::Scraper
>  > http://search.cpan.org/dist/Web-Scraper/
>
>  This looks like a handy module for some purposes, but it's not sufficient
>  for fully parsing microformats:
>
>  1. It doesn't seem to support the rel attribute on links, so it will fall
>  down when looking for rel-tag (which is used for encoding categories in
>  hcard).

  process q(*[rel~="me"]), "urls[]" => '@href";

>  2. For images, the alt text should normally be returned (except for a few
>  properties like photo and logo in hcard), but this module doesn't read alt
>  text. [ It seems that my code kicks Operator's ass in this dept ;-) ]

  process 'img', text => '@alt';

>  Probably other reasons too, but I can't be bothered to think them all
>  through. These three ought to be enough to deter people from using it for
>  serious parsing though.

Examples shown in the previous email was a quick one. We can use more
fully-fledge XPath or DOM API to do the heavy-lifting.

>  To parse microformats properly you need DOM, or something of similar
>  sophistication. For what it's worth, for alpha3 of my code I've switched
>  to XML::LibXML, which is an alternative to XML::DOM. It copes much better
>  with parsing random HTML off the web, and has namespace support should I
>  decide I need it for anything.

The module uses HTML::TreeBuilder, which is a slick module to give you
DOM access to web pages. In the latest svn repository we use the
internal to XML::LibXML with "relaxed" options, as well.

--
Tatsuhiko Miyagawa