[uf-discuss] Re: Perl microformat parsing

Thu Feb 21 02:14:00 PST 2008

Rob Manson wrote:

> Here's a patch to prove that this is the problem using a quick and dirty
> regex fix:
> 
> 848d847
> <       $html =~ s/\&nbsp\;//igm;
> 
> I tried it on both a simple hcard like
> http://microformats.org/wiki/User:RobManson and the full hcard page
> (which is veeeeery slow to parse) http://microformats.org/wiki/hcard and
> the patch fixes it.

Thanks for your hint. The XML::Parser module is able to fetch DTDs and use 
them, so should be able to handle expansion of named entities by itself -- 
the only problem was that I had disabled it, partly to cut down on 
bandwidth usage, but also because I thought it would break too many pages 
to validate them. Anyway, I've re-enabled it and this seems to have fixed 
more pages than it's broken. I'm guessing that XML::Parser does not 
validate based on the DTD -- it just uses them to expand entities.

With regards to speed, that's because I'm using LWP::RobotUA instead of 
LWP::UserAgent. This downloads the robots.txt (and honours it) and also 
enforces a delay between each request. The delay is 1 minute by default 
though I set it to 10 seconds -- or at least I thought I did, but I was 
trying to set it in the LWP::RobotUA constructor function, which it seems 
does not work. The delay is now set to 5 seconds and works. This has made 
it significantly faster.

New version (0.1-alpha2.1):

Online:   http://buzzword.org.uk/cognition/cognition-0.1-alpha2.1.pl
Download: http://buzzword.org.uk/cognition/cognition-0.1-alpha2.1.txt

This successfully parses both the pages you mentioned above.

Thanks again,

-- 
Toby A Inkster BSc (Hons) ARCS
[Geek of HTML/SQL/Perl/PHP/Python/Apache/Linux]
[OS: Linux 2.6.17.14-mm-desktop-9mdvsmp, up 22 days, 16:20.]

                               Bottled Water
          http://tobyinkster.co.uk/blog/2008/02/18/bottled-water/