[uf-discuss] Re: Perl microformat parsing
Toby A Inkster
mail at tobyinkster.co.uk
Thu Feb 21 02:14:00 PST 2008
Rob Manson wrote:
> Here's a patch to prove that this is the problem using a quick and dirty
> regex fix:
>
> 848d847
> < $html =~ s/\ \;//igm;
>
> I tried it on both a simple hcard like
> http://microformats.org/wiki/User:RobManson and the full hcard page
> (which is veeeeery slow to parse) http://microformats.org/wiki/hcard and
> the patch fixes it.
Thanks for your hint. The XML::Parser module is able to fetch DTDs and use
them, so should be able to handle expansion of named entities by itself --
the only problem was that I had disabled it, partly to cut down on
bandwidth usage, but also because I thought it would break too many pages
to validate them. Anyway, I've re-enabled it and this seems to have fixed
more pages than it's broken. I'm guessing that XML::Parser does not
validate based on the DTD -- it just uses them to expand entities.
With regards to speed, that's because I'm using LWP::RobotUA instead of
LWP::UserAgent. This downloads the robots.txt (and honours it) and also
enforces a delay between each request. The delay is 1 minute by default
though I set it to 10 seconds -- or at least I thought I did, but I was
trying to set it in the LWP::RobotUA constructor function, which it seems
does not work. The delay is now set to 5 seconds and works. This has made
it significantly faster.
New version (0.1-alpha2.1):
Online: http://buzzword.org.uk/cognition/cognition-0.1-alpha2.1.pl
Download: http://buzzword.org.uk/cognition/cognition-0.1-alpha2.1.txt
This successfully parses both the pages you mentioned above.
Thanks again,
--
Toby A Inkster BSc (Hons) ARCS
[Geek of HTML/SQL/Perl/PHP/Python/Apache/Linux]
[OS: Linux 2.6.17.14-mm-desktop-9mdvsmp, up 22 days, 16:20.]
Bottled Water
http://tobyinkster.co.uk/blog/2008/02/18/bottled-water/
More information about the microformats-discuss
mailing list