[uf-discuss] Re: Perl microformat parsing
Takatsugu Shigeta
takatsugu.shigeta at gmail.com
Fri Feb 22 10:39:30 PST 2008
Hi Toby,
If you want to scrape only web pages,
I would like to recommend the following CPAN module.
Web::Scraper
http://search.cpan.org/dist/Web-Scraper/
== sample code ==
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use URI;
use Web::Scraper;
my $url = 'http://diveintomark.org/projects/greasemonkey/hcard/tests/2-4-2-vcard.xhtml';
my $fn = scraper {
process '.vcard .fn', 'fn[]' => 'TEXT';
process '.vcard .tel', 'tel[]' => 'TEXT';
process '.vcard .title', 'title[]' => 'TEXT';
result 'fn', 'tel', 'title';
}->scrape(URI->new($url));
print Dumper $fn;
== sample output ==
$ perl hcard.pl
$VAR1 = {
'tel' => [
'+1-919-555-7878'
],
'title' => [
'Area Administrator, Assistant'
],
'fn' => [
'Joe Friday'
]
};
Thanks.
-- shigeta
On Thu, Feb 21, 2008 at 7:14 PM, Toby A Inkster <mail at tobyinkster.co.uk> wrote:
> Rob Manson wrote:
>
> > Here's a patch to prove that this is the problem using a quick and dirty
> > regex fix:
> >
> > 848d847
> > < $html =~ s/\ \;//igm;
> >
> > I tried it on both a simple hcard like
> > http://microformats.org/wiki/User:RobManson and the full hcard page
> > (which is veeeeery slow to parse) http://microformats.org/wiki/hcard and
> > the patch fixes it.
>
> Thanks for your hint. The XML::Parser module is able to fetch DTDs and use
> them, so should be able to handle expansion of named entities by itself --
> the only problem was that I had disabled it, partly to cut down on
> bandwidth usage, but also because I thought it would break too many pages
> to validate them. Anyway, I've re-enabled it and this seems to have fixed
> more pages than it's broken. I'm guessing that XML::Parser does not
> validate based on the DTD -- it just uses them to expand entities.
>
> With regards to speed, that's because I'm using LWP::RobotUA instead of
> LWP::UserAgent. This downloads the robots.txt (and honours it) and also
> enforces a delay between each request. The delay is 1 minute by default
> though I set it to 10 seconds -- or at least I thought I did, but I was
> trying to set it in the LWP::RobotUA constructor function, which it seems
> does not work. The delay is now set to 5 seconds and works. This has made
> it significantly faster.
>
> New version (0.1-alpha2.1):
>
> Online: http://buzzword.org.uk/cognition/cognition-0.1-alpha2.1.pl
> Download: http://buzzword.org.uk/cognition/cognition-0.1-alpha2.1.txt
>
> This successfully parses both the pages you mentioned above.
>
> Thanks again,
>
>
> --
> Toby A Inkster BSc (Hons) ARCS
> [Geek of HTML/SQL/Perl/PHP/Python/Apache/Linux]
> [OS: Linux 2.6.17.14-mm-desktop-9mdvsmp, up 22 days, 16:20.]
>
>
> Bottled Water
> http://tobyinkster.co.uk/blog/2008/02/18/bottled-water/
>
> _______________________________________________
>
>
> microformats-discuss mailing list
> microformats-discuss at microformats.org
> http://microformats.org/mailman/listinfo/microformats-discuss
>
More information about the microformats-discuss
mailing list