[uf-discuss] Re: Perl microformat parsing
Tatsuhiko Miyagawa
miyagawa at gmail.com
Sat Mar 1 07:26:05 PST 2008
On Sat, Feb 23, 2008 at 11:50 PM, Toby A Inkster <mail at tobyinkster.co.uk> wrote:
> > Web::Scraper
> > http://search.cpan.org/dist/Web-Scraper/
>
> This looks like a handy module for some purposes, but it's not sufficient
> for fully parsing microformats:
>
> 1. It doesn't seem to support the rel attribute on links, so it will fall
> down when looking for rel-tag (which is used for encoding categories in
> hcard).
process q(*[rel~="me"]), "urls[]" => '@href";
> 2. For images, the alt text should normally be returned (except for a few
> properties like photo and logo in hcard), but this module doesn't read alt
> text. [ It seems that my code kicks Operator's ass in this dept ;-) ]
process 'img', text => '@alt';
> Probably other reasons too, but I can't be bothered to think them all
> through. These three ought to be enough to deter people from using it for
> serious parsing though.
Examples shown in the previous email was a quick one. We can use more
fully-fledge XPath or DOM API to do the heavy-lifting.
> To parse microformats properly you need DOM, or something of similar
> sophistication. For what it's worth, for alpha3 of my code I've switched
> to XML::LibXML, which is an alternative to XML::DOM. It copes much better
> with parsing random HTML off the web, and has namespace support should I
> decide I need it for anything.
The module uses HTML::TreeBuilder, which is a slick module to give you
DOM access to web pages. In the latest svn repository we use the
internal to XML::LibXML with "relaxed" options, as well.
--
Tatsuhiko Miyagawa
More information about the microformats-discuss
mailing list